not sure if this question should be here or in serverfault, but it's java-related so here it is:
I have two servers, with very similar technology:
server1 is Oracle/Sun x86 with dual x5670 CPU (2.93 GHz) (4 cores each), 12GB RAM.
server2 is Dell R610 with dual x5680 CPU (3.3 GHz) (6 cores each), 16GB RAM.
both are running Solaris x86, with exact same configuration.
both have turbo-boost enabled, and no hyper-threading.
server2 should therefore be SLIGHTLY faster than server1.
I'm running the following short test program on the two platforms.
import java.io.*;
public class TestProgram {
public static void main(String[] args) {
new TestProgram ();
}
public TestProgram () {
try {
PrintWriter writer = new PrintWriter(new FileOutputStream("perfs.txt", true), true);
for (int i = 0; i < 10000; i++) {
long t1 = System.nanoTime();
System.out.println("0123456789qwertyuiop0123456789qwertyuiop0123456789qwertyuiop0123456789qwertyuiop");
long t2 = System.nanoTime();
writer.println((t2-t1));
//try {
// Thread.sleep(1);
//}
//catch(Exception e) {
// System.out.println("thread sleep exception");
//}
}
}
catch(Exception e) {
e.printStackTrace(System.out);
}
}
}
I'm opening perfs.txt and averaging the results, I get:
server1: average = 1664 , trim 10% = 1615
server2: average = 1510 , trim 10% = 1429
which is a somewhat expected result (server2 perfs > server1 perfs).
now, I uncomment the "Thread.sleep(1)" part and test again, the results are now:
server1: average = 27598 , trim 10% = 26583
server2: average = 52320 , trim 10% = 39359
this time server2 perfs < server1 perfs
that doesn't make any sense to me...
obviously I'm looking at a way to improve server2 perfs in the second case. there must be some kind of configuration that is different, and I don't know which one.
OS are identical, java version are identical.
could it be linked to the number of cores ?
maybe it's a BIOS setting ? although BIOS are different (AMI vs Dell), settings seem pretty similar.
I'll update the Dell's BIOS soon and retest, but I would appreciate any insight...
thanks
I would try a different test program, try running somthing like this.
public class Timer implements Runnable
{
public void startTimer()
{
time = 0;
running = true;
thread = new Thread(this);
thread.start();
}
public int stopTimer()
{
running = false;
return time;
}
public void run()
{
try
{
while(running)
{
Thread.sleep(1);
time++;
}
}catch(Exception e){e.printStackTrace();}
}
private int time;
private Thread thread;
private boolean running;
}
Thats the timer now heres the main:
public class Main
{
public static void main(String args[])
{
Timer timer = new Timer();
timer.startTimer();
for(int x=0;x<1000;x++)
System.out.println("Testing!!");
System.out.println("\n\nTime Taken: "+timer.stopTimer());
}
}
I think this is a good way to test wich system is truely running faster. Try this and let me know how it goes.
Ok, I have a theory: the Thread.sleep() prevents the hotspot compiler from kicking in. Because you have a sleep, it assumes the loop isn't "hot", i.e that it doesn't matter too much how efficient the code in the loop is (because, after all, you're sleep's only purpose could be to slow things down).
Hence, you add a Thread.sleep() inside the loop, and the other stuff in the loop also runs slower.
I wonder if it might make a difference if you have a loop inside a loop and measure the performance of the inner loop? (and only have the Thread.sleep() in the outer loop). In this case the compiler might optimize the inner loop (if there are enough iterations).
(Brings up a question: if this code is a test case extracted from production code, why does the production code sleep?)
I actually updated the BIOS on the DELL R610 and ensured all BIOS CPU parameters are adjusted for best low-latency performances (no hyper-threading, etc...).
it solved it. The performances with & without the Thread.sleep make sense, and the overall performances of the R610 in both cases are much better than the Sun.
It appears the original BIOS did not make a correct or a full usage of the nehalem capabilities (while the Sun did).
You are testing how fast the console updates. This is entirely OS and window dependent. If you run this in your IDE it will be much slower than running in an xterm. Even which font you use and how big your window is will make a big different to performance. If your window is closed while you run the test this will improve performance.
Here is how I would run the same test. This test is self contained and does the analysis you need.
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;
import java.util.Arrays;
public class TestProgram {
public static void main(String... args) throws IOException {
File file = new File("out.txt");
file.deleteOnExit();
PrintWriter out = new PrintWriter(new FileWriter(file), true);
int runs = 100000;
long[] times = new long[runs];
for (int i = -10000; i < runs; i++) {
long t1 = System.nanoTime();
out.println("0123456789qwertyuiop0123456789qwertyuiop0123456789qwertyuiop0123456789qwertyuiop");
long t2 = System.nanoTime();
if (i >= 0)
times[i] = t2 - t1;
}
out.close();
Arrays.sort(times);
System.out.printf("Median time was %,d ns, the 90%%tile was %,d ns%n", times[times.length / 2], times[times.length * 9 / 10]);
}
}
prints on a 2.6 GHz Xeon WIndows Vista box
Median time was 3,213 ns, the 90%tile was 3,981 ns
Related
I have two versions of a program with the same purpose: to calculate how many prime numbers there are between 0 and n.
The first version uses concurrency, a Callable class "does the math" and the results are retrieved though a Future array. There are as many created threads as processors in my computer (4).
The second version is implemented via RMI. All four servers are registered in the local host. The servers are working in paralell as well, obviously.
I would expect the second version to be slower than the first, because I guess the network would involve latency and the other version would just run the program concurrently.
However, the RMI version is around twice faster than the paralel version... Why is this happening?!
I didn't paste any code because it'd be huge, but ask for it in case you need it and I'll see what I can do...
EDIT: adding the code. I commented the sections where unnecessary code was to be posted.
Paralell version
public class taskPrimes implements Callable
{
private final long x;
private final long y;
private Long total = new Long(0);
public taskPrimes(long x, long y)
{
this.x = x;
this.y = y;
}
public static boolean isPrime(long n)
{
if (n<=1) return false ;
for (long i=2; i<=Math.sqrt(n); i++)
if (n%i == 0) return false;
return true;
}
public Long call()
{
for (long i=linf; i<=lsup;i++)
if (isPrime(i)) total++;
return total;
}
}
public class paralelPrimes
{
public static void main(String[] args) throws Exception
{
// here some variables...
int nTasks = Runtime.getRuntime().availableProcessors();
ArrayList<Future<Long>> partial = new ArrayList<Future<Long>>();
ThreadPoolExecutor ept = new ThreadPoolExecutor();
for(int i=0; i<nTasks; i++)
{
partial.add(ept.submit(new taskPrimes(x, y))); // x and y are the limits of the range
// sliding window here
}
for(Future<Long> iterator:partial)
try { total += iterator.get(); } catch (Exception e) {}
}
}
RMI version
Server
public class serverPrimes
extends UnicastRemoteObject
implements interfacePrimes
{
public serverPrimes() throws RemoteException {}
#Override
public int primes(int x, int y) throws RemoteException
{
int total = 0;
for(int i=x; i<=y; i++)
if(isPrime(i)) total++;
return total;
}
#Override
public boolean isPrime(int n) throws RemoteException
{
if (n<=1) return false;
for (int i=2; i<=Math.sqrt(n); i++)
if (n%i == 0) return false ;
return true;
}
public static void main(String[] args) throws Exception
{
interfacePrimes RemoteObject1 = new serverPrimes();
interfacePrimes RemoteObject2 = new serverPrimes();
interfacePrimes RemoteObject3 = new serverPrimes();
interfacePrimes RemoteObject4 = new serverPrimes();
Naming.bind("Server1", RemoteObject1);
Naming.bind("Server2", RemoteObject2);
Naming.bind("Server3", RemoteObject3);
Naming.bind("Server4", RemoteObject4);
}
}
Client
public class clientPrimes implements Runnable
{
private int x;
private int y;
private interfacePrimes RemoteObjectReference;
private static AtomicInteger total = new AtomicInteger();
public clientPrimes(int x, int y, interfacePrimes RemoteObjectReference)
{
this.x = x;
this.y = y;
this.RemoteObjectReference = RemoteObjectReference;
}
#Override
public void run()
{
try
{
total.addAndGet(RemoteObjectReference.primes(x, y));
}
catch (RemoteException e) {}
}
public static void main(String[] args) throws Exception
{
// some variables here...
int nServers = 4;
ExecutorService e = Executors.newFixedThreadPool(nServers);
double t = System.nanoTime();
for (int i=1; i<=nServers; i++)
{
e.submit(new clientPrimes(xVentana, yVentana, (interfacePrimes)Naming.lookup("//localhost/Server"+i)));
// sliding window here
}
e.shutdown();
while(!e.isTerminated());
t = System.nanoTime()-t;
}
}
One interesting thing to consider is that, by default, the jvm runs in client mode. This means that threads won't span over the cores in the most agressive way. Trying to run the program with -server option can influence the result although, as mentioned, the algorithm design is crucial the concurrent version may have bottlenecks. There is little chance that, given the problem, there is a bottleneck in your algorithm, but it sure needs to be considered.
The rmi version truly runs in parallel because each object runs on a different machine, since this tends to be a processing problem more than a communication problem then the latency plays a non important part.
[UPDATE]
Now that I saw your code lets get into some more details.
You are relying on the ThreadExecutorPool and Future to perform the thread control and synchronization for you, this means (by the documentation) that your running objects will be allocated on an existing thread and once your object finishes its computation the thread will be returned to that pool, on the other hand the Future will check periodically if the computation has finished so it can collect the values.
This scenario would be best fit for some computation that is performed periodically in a way that the ThreadPool could increase performance by having the threads pre-allocated (having the overhead of thread creation only on the first time the threads aren't there).
Your implementation is correct, but it is more centered on the programmer convinience (there is nothing wrong with this, I am always defending this point of view) than on system performance.
The RMI version performs differently due (mainly) of 2 things:
1 - you said you are running on the same machine, most OS will recognize localhost, 127.0.0.1 or even the real self ip address as being its self address and perform some optimizations on the communication, so little overhead from the network here.
2 - the RMI system will create a separate thread for each server object you created (as I mentioned before) and these servers will starting computing as soon as they get called.
Things you should try to experiment:
1 - Try to run your RMI version truly on a network, if you can configure it for 10Mbps would be better to see communication overhead (although, since it is a one shot communication it may have little impact to notice, you could chance you client application to call for the calculation multiple times, and then you see the lattency in the way)
2 - Try to change you parallel implementation to use Threads directly with no Future (you could use Thread.join to monitor execution end) and then use the -server option on the machine (although, sometimes, the JVM performs a check to see if the machine configuration can be truly said to be a server and will decline to move to that profile). The main problem is that if your threads doesn't get to use all the computer cores you won't see any performance improvent. Also try to perform the calculations many time to overcome the overhead of thread creation.
Hope that helps to elucidate the situation :)
Cheers
It depends on how your Algorithms are designed for parallel and concurrent solutions. There is no a criteria where parallel must be better than concurrent or viceversa. By example if your concurrent solution has many synchronized blocks it can drop your performance, in the other case maybe the communication in your parallel algorithm is minimum then there is no overhead on network.
If you can get a copy o the book of Peter Pacheco it can clear some ideas:http://www.cs.usfca.edu/~peter/ipp/
Given the details you provided, it will mostly depend on how large a range you're using, and how efficiently you distribute the work to the servers.
For instance, I'll bet that for a small range N you will probably have no speedup from distributing via RMI. In this case, the RMI (network) overhead will likely outweigh the benefit of distributing over multiple servers. When N becomes large, and with an efficient distribution algorithm, this overhead will become more and more negligible with regards to the actual computation time.
For example, assuming homogenous servers, a relatively efficient distribution could be to tell each server to compute the primes for all the numbers n such that n % P = i, where n <= N, P is the number of servers, i is an index in the range [0, P-1] assigned to each server, and % is the modulo operation.
I have a class that creates a random string based on BigInteger. All works fine and efficient when run standalone (Windows, 22ms).
private SecureRandom random = new SecureRandom();
public String testMe() {
return new BigInteger(130, random).toString(30)
}
When this code is put into a library (jar) and called from Coldfusion (9.0.2), this code hangs for 1 to 1.5 minutes (on my server, linux). This code is called from a cfc:
<cfset myTest = CreateObject("java", "com.acme.MyTest")>
<cffunction name="runTest" access="public">
<cfset var value = myTest.testMe()/>
</cffunction>
What am I missing?
I am just astonished that the difference was not noticable on my Windows box.
There are different SecureRandom strategies. On window it could be using a random seed based on the host name, which for windows can wander off to a DNS to get a reverse lookup the first time. This can time out the request after a minute or so.
I would ensure you have a recent update of Java because I believe this is a problem which was fixed in some update of Java 6. (Not to do with SecureRandom, but with first network operation being incredibly slow)
BTW This was tested on a Windows 7 box and the first time, it hung for a few seconds, but not after that.
If your code is hanging to 60 to 90 seconds, it is not due to this method, far more likely you are performing a GC, and this method is stopping because it allocated memory.
While BigInteger is slow, SecureRandom is much, much slower. If you want this to be faster, use plain Random.
It would be slightly faster if you used less bits.
BTW I would use base 36 (the maximum), rather than base 30.
static volatile String dontOptimiseAway = null;
public static void testRandomBigInteger(Random random) {
long start = System.nanoTime();
int runs = 10000;
for(int i=0;i< runs;i++) {
dontOptimiseAway = new BigInteger(130, random).toString(36);
}
long time = System.nanoTime() - start;
System.out.printf("%s took %.1f micro-seconds on average%n", random.getClass().getSimpleName(), time/runs/1e3);
}
public static void main(String... ignored) {
for (int i = 0; i < 10; i++) {
testRandomBigInteger(new Random());
testRandomBigInteger(new SecureRandom());
}
}
prints
Random took 1.7 micro-seconds on average
SecureRandom took 2.1 micro-seconds on average
The time to generate the string is significant, but still no where near enough to cause a multi-second delay.
I'm attempting to test some benchmarking tools by running them against a simple program which increments a variable as many times as possible for 1000 milliseconds.
How many incrementations of a single 64 bit number should I expect to be able to perform on an intel i7 chip on the JDK for Mac OS X ?
My current methodology is :
start thread (t2) that continually increments "i" in an infinite loop (for(;;;)).
let the main thread (call it t1) sleep for 1000 milliseconds.
have t1 interrupt (or stop, since this deprecated method works on Apple's JDK 6) t2.
Currently, I am reproducibly getting about 2E8 incrementations (this is tabulated below: the value shown is the value that is printed when the incrementing thread is interrupted after a 1000 millisecond sleep() in the calling thread).
217057470
223302277
212337757
215177075
214785738
213849329
215645992
215651712
215363726
216135710
How can I know wether this benchmark is reasonable or not, i.e., what is the theoretical fastest speed at which an i7 chip should be able to increment a single 64-bit digit? This code is running in the JVM and is below:
package net.rudolfcode.jvm;
/**
* How many instructions can the JVM exeucte in a second?
* #author jayunit100
*/
public class Example3B {
public static void main(String[] args){
for(int i =0 ; i < 10 ; i++){
Thread addThread = createThread();
runForASecond(addThread,1000);
}
}
private static Thread createThread() {
Thread addThread = new Thread(){
Long i =0L;
public void run() {
boolean t=true;
for (;;) {
try {
i++;
}
catch (Exception e) {
e.printStackTrace();
}
}
}
#Override
public void interrupt() {
System.out.println(i);
super.interrupt();
}
};
return addThread;
}
private static void runForASecond(Thread addThread, int milli) {
addThread.start();
try{
Thread.sleep(milli);
}
catch(Exception e){
}
addThread.interrupt();
//stop() works on some JVMs...
addThread.stop();
}
}
Theoretically, making some assumptions which are probably not valid:
Assume that a number can be incremented in 1 instruction (probably not, because you're running in a JVM and not natively)
Assume that a 2.5 GHz processor can execute 2,500,000,000 instructions per second (but in reality, it's more complicated than that)
Then you could say that 2,500,000,000 increments in 1 second is a "reasonable" upper bound based on the simplest possible back-of-the-envelope estimation.
How far off is that from your measurement?
2,500,000,000 is O(1,000,000,000)
2E8 is O(100,000,000)
So we're only off by 1 order of magnitude. Given the wildly unfounded assumptions – sounds reasonable to me.
First of all beware of JVM optimisations! You must be sure you measure exactly what you think you do. Since Long i =0L; is not volatile and it's effectively useless (nothing is done to intermediate values) JIT can do pretty nasty stuff.
As for the estimation you can think of not more then X*10^9 operations per second on X GHz machine. You can safely divide this value for 10 for probably because instructions aren't mapped 1:1.
So you're pretty close :)
I have a simple recursive method, a depth first search. On each call, it checks if it's in a leaf, otherwise it expands the current node and calls itself on the children.
I'm trying to make it parallel, but I notice the following strange (for me) problem.
I measure execution time with System.currentTimeMillis().
When I break the search into a number of subsearches and add the total execution time, I get a bigger number than the sequential search. I only measure execution time, no communication or sync, etc. I would expect to get the same time when I add the times of the subtasks. This happens even if I just run one task after the other, so without threads. If I just break the search into some subtasks and run the subtasks one after the other, I get a bigger time.
If I add the number of method calls for the subtasks, I get the same number as the sequential search. So, basically, in both cases I do the same number of method calls, but I get different times.
I'm guessing there's some overhead on initial method calls or something else caused by a JVM mechanism. Any ideas what could it be?
For example, one sequential search takes around 3300 ms. If I break it into 13 tasks, it takes a total time of 3500ms.
My method looks like this:
private static final int dfs(State state) {
method_calls++;
if(state.isLeaf()){
return 1;
}
State[] children = state.expand();
int result = 0;
for (int i = 0; i < children.length; i++) {
result += dfs(children[i]);
}
return result;
}
Whenever I call it, I do it like this:
for(int i = 0; i < num_tasks; i++){
long start = System.currentTimeMillis();
dfs(tasks[i]);
totalTime += (System.currentTimeMillis() - start);
}
Problem is totalTime increases with num_tasks and I would expect to stay the same because the method_calls variable stays the same.
You should average out the numbers over longer runs. Secondly the precision of currentTimeMillis may not be sufficient, you can try using System.nanoTime().
As in all the programming languages, whenever you call a procedure or a method, you have to push the environment, initialize the new one, execute the programs instructions, return the value on the stack and finally reset the previous environment. It cost a bit! Create a thread cost also more!
I suppose that if you enlarge the researching tree you will have benefit by the parallelization.
Adding system clock time for several threads seems a weird idea. Either you are interested in the time until processing is complete, in which case adding doesn't make sense, or in cpu usage, in which case you should only count when the thread is actually scheduled to execute.
What probably happens is that at least part of the time, more threads are ready to execute than the system has cpu cores, and the scheduler puts one of your threads to sleep, which causes it to take longer to complete. It makes sense that this effect is exacerbated the more threads you use. (Even if your program uses less threads than you have cores, other programs (such as your development environment, ...) might).
If you are interested in CPU usage, you might wish to query ThreadMXBean.getCurrentThreadCpuTime
I'd expect to see Threads used. Something like this:
import java.util.concurrent.Executor;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
public class Puzzle {
static volatile long totalTime = 0;
private static int method_calls = 0;
/**
* #param args
*/
public static void main(String[] args) {
final int num_tasks = 13;
final State[] tasks = new State[num_tasks];
ExecutorService threadPool = Executors.newFixedThreadPool(5);
for(int i = 0; i < num_tasks; i++){
threadPool.submit(new DfsRunner(tasks[i]));
}
try {
threadPool.shutdown();
threadPool.awaitTermination(1, TimeUnit.SECONDS);
} catch (InterruptedException e) {
System.out.println("Interrupted");
}
System.out.println(method_calls + " Methods in " + totalTime + "msecs");
}
static final int dfs(State state) {
method_calls++;
if(state.isLeaf()){
return 1;
}
State[] children = state.expand();
int result = 0;
for (int i = 0; i < children.length; i++) {
result += dfs(children[i]);
}
return result;
}
}
With the runnable bit like this:
public class DfsRunner implements Runnable {
private State state;
public DfsRunner(State state) {
super();
this.state = state;
}
#Override
public void run() {
long start = System.currentTimeMillis();
Puzzle.dfs(state);
Puzzle.totalTime += (System.currentTimeMillis() - start);
}
}
I know there are other questions like that but I'm a beginner and most of the code and questions were quite complicated. Thats why I keep it as simple as possible. I come from an R background but recently I wanted to learn more about Java threads. I run through several tutorial on the topic and most of it boils down to the code I posted below. Note the code is not doing much and I made it quite inefficient so the threads would run a few seconds.
The main thing to notice is that on my machine the threads run not much faster than the none threaded run. With low values in the for loop in the run method even sometimes slower. It could be because of my crappy hardware (only two cores), and that with more cores one would see the threads go faster than the non parallel version. I don't know. But what puzzles me most is that when I look at the System monitor while the program is running in both runs (parallel and non parallel) both cores are used but in the parallel version they run at nearly 100 % and in non parallel both run at 50 - 60 %. Considering that both finish at the same time the parallel version is a lot more inefficient because it uses more computer power for doing the same job not even faster.
To put it in the nutshell. What am I doing wrong? I thought I wrote the the program not much different than in the Java tutorial. I posted the link below. I run linux ubuntu with the sun version of java.
http://www.java2s.com/Tutorial/Java/0160__Thread/0020__Create-Thread.htm
import java.util.ArrayList;
public class Main {
public static void main(String[] args) {
ArrayList<PermutateWord> words = new ArrayList<PermutateWord>();
System.out.println(Runtime.getRuntime().availableProcessors());
for(int i = 0; i < Runtime.getRuntime().availableProcessors();i++){
words.add(new PermutateWord("Christoph"));
}
System.out.println("Run as thread");
long d = System.currentTimeMillis();
for (PermutateWord w : words) {
w.start();
}
for (PermutateWord w : words) {
try {
w.join();
} catch (InterruptedException e) {
e.printStackTrace();
}
}
for (PermutateWord w : words) {
System.out.println(w.getWord());
}
System.out.println(((double)(System.currentTimeMillis()-d))/1000+"\n");
System.out.println("No thread");
d = System.currentTimeMillis();
for (PermutateWord w : words) {
w.run();
}
for (PermutateWord w : words) {
System.out.println(w.getWord());
}
System.out.println(((double)(System.currentTimeMillis()-d))/1000+"\n");
}
}
class PermutateWord extends Thread {
private String word;
public PermutateWord (String word){
this.word = word;
}
public void run() {
java.util.Random rand = new java.util.Random();
for(int i = 0; i <8000000;i++){
word = swap(word,rand.nextInt(word.length()), rand.nextInt(word.length()));
}
}
private String swap(String word2, int r1, int r2) {
char[] wordArray = word2.toCharArray();
char c = wordArray[r1];
wordArray[r1] = wordArray[r2];
wordArray[r2] = c;
return new String(wordArray);
}
public String getWord(){
return word;
}
}
Thanks in advance
Christoph
Most of the time is spend allocating and dealocating temporary strings, which has to be synchronized. The work that can be done in parallel is trivial and multiple threads won't give you much gain.
Math.random() also has to be synchronized. You will have better results creating local java.util.Random for each thread.
java.util.Random rand = new java.util.Random();
public void run() {
for(int i = 0; i <8000000;i++){
word = swap(word,rand.nextInt(word.length()), rand.nextInt(word.length()));
}
}
But, you should really focus on optimizing swap function. I'm not sure, if it does what you want, but I'm sure it's very inefficient. + is expensive operation on Strings. For every + JVM has to allocate new String which is slow and doesn't work well with multiple threads. If you just want to swap two characters, consider using char[] instead of String. It should be much easier and much faster.
edit:
private String swap(String word2, int r1, int r2) {
char[] wordArray = word2.toCharArray();
char c = wordArray[r1];
wordArray[r1] = wordArray[r2];
wordArray[r2] = c;
return new String(wordArray);
}
This is much better. However, you are still doing 2 allocations. toCharArray() and new String both allocate memory. Because rest of your program is very simple, those two allocations take 90% of your execution time.
I got a lot of mileage out of putting a Thread.sleep(1000) in the join loop.
Empirically, java.util.Random.nextFloat() only bought me 10%.
Even then, both parts ran in 16 seconds on an 8-core machine, suggesting it's
serializing due to the synchronizations mentioned above. But good, grief, without
the sleep it was running 10x slower.