I know there are other questions like that but I'm a beginner and most of the code and questions were quite complicated. Thats why I keep it as simple as possible. I come from an R background but recently I wanted to learn more about Java threads. I run through several tutorial on the topic and most of it boils down to the code I posted below. Note the code is not doing much and I made it quite inefficient so the threads would run a few seconds.
The main thing to notice is that on my machine the threads run not much faster than the none threaded run. With low values in the for loop in the run method even sometimes slower. It could be because of my crappy hardware (only two cores), and that with more cores one would see the threads go faster than the non parallel version. I don't know. But what puzzles me most is that when I look at the System monitor while the program is running in both runs (parallel and non parallel) both cores are used but in the parallel version they run at nearly 100 % and in non parallel both run at 50 - 60 %. Considering that both finish at the same time the parallel version is a lot more inefficient because it uses more computer power for doing the same job not even faster.
To put it in the nutshell. What am I doing wrong? I thought I wrote the the program not much different than in the Java tutorial. I posted the link below. I run linux ubuntu with the sun version of java.
http://www.java2s.com/Tutorial/Java/0160__Thread/0020__Create-Thread.htm
import java.util.ArrayList;
public class Main {
public static void main(String[] args) {
ArrayList<PermutateWord> words = new ArrayList<PermutateWord>();
System.out.println(Runtime.getRuntime().availableProcessors());
for(int i = 0; i < Runtime.getRuntime().availableProcessors();i++){
words.add(new PermutateWord("Christoph"));
}
System.out.println("Run as thread");
long d = System.currentTimeMillis();
for (PermutateWord w : words) {
w.start();
}
for (PermutateWord w : words) {
try {
w.join();
} catch (InterruptedException e) {
e.printStackTrace();
}
}
for (PermutateWord w : words) {
System.out.println(w.getWord());
}
System.out.println(((double)(System.currentTimeMillis()-d))/1000+"\n");
System.out.println("No thread");
d = System.currentTimeMillis();
for (PermutateWord w : words) {
w.run();
}
for (PermutateWord w : words) {
System.out.println(w.getWord());
}
System.out.println(((double)(System.currentTimeMillis()-d))/1000+"\n");
}
}
class PermutateWord extends Thread {
private String word;
public PermutateWord (String word){
this.word = word;
}
public void run() {
java.util.Random rand = new java.util.Random();
for(int i = 0; i <8000000;i++){
word = swap(word,rand.nextInt(word.length()), rand.nextInt(word.length()));
}
}
private String swap(String word2, int r1, int r2) {
char[] wordArray = word2.toCharArray();
char c = wordArray[r1];
wordArray[r1] = wordArray[r2];
wordArray[r2] = c;
return new String(wordArray);
}
public String getWord(){
return word;
}
}
Thanks in advance
Christoph
Most of the time is spend allocating and dealocating temporary strings, which has to be synchronized. The work that can be done in parallel is trivial and multiple threads won't give you much gain.
Math.random() also has to be synchronized. You will have better results creating local java.util.Random for each thread.
java.util.Random rand = new java.util.Random();
public void run() {
for(int i = 0; i <8000000;i++){
word = swap(word,rand.nextInt(word.length()), rand.nextInt(word.length()));
}
}
But, you should really focus on optimizing swap function. I'm not sure, if it does what you want, but I'm sure it's very inefficient. + is expensive operation on Strings. For every + JVM has to allocate new String which is slow and doesn't work well with multiple threads. If you just want to swap two characters, consider using char[] instead of String. It should be much easier and much faster.
edit:
private String swap(String word2, int r1, int r2) {
char[] wordArray = word2.toCharArray();
char c = wordArray[r1];
wordArray[r1] = wordArray[r2];
wordArray[r2] = c;
return new String(wordArray);
}
This is much better. However, you are still doing 2 allocations. toCharArray() and new String both allocate memory. Because rest of your program is very simple, those two allocations take 90% of your execution time.
I got a lot of mileage out of putting a Thread.sleep(1000) in the join loop.
Empirically, java.util.Random.nextFloat() only bought me 10%.
Even then, both parts ran in 16 seconds on an 8-core machine, suggesting it's
serializing due to the synchronizations mentioned above. But good, grief, without
the sleep it was running 10x slower.
Related
I recently built a Fibonacci generator that uses recursion and hashmaps to reduce complexity. I am using the System.nanoTime() to keep track of the time it takes for my program to print 10000 Fibonacci number. It started out good with less than a second but gradually became slower and now it takes more than 4 seconds. Can someone explain why this might be happening. The code is down here-
import java.util.*;
import java.math.*;
public class FibonacciGeneratorUnlimited {
static int numFibCalls = 0;
static HashMap<Integer, BigInteger> d = new HashMap<Integer, BigInteger>();
static Scanner fibNumber = new Scanner(System.in);
static BigInteger ans = new BigInteger("0");
public static void main(String[] args){
d.put(0 , new BigInteger("0"));
d.put(1 , new BigInteger("1"));
System.out.print("Enter the term:\t");
int n = fibNumber.nextInt();
long startTime = System.nanoTime();
for (int i = 0; i <= n; i++) {
System.out.println(i + " : " + fib_efficient(i, d));
}
System.out.println((double)(System.nanoTime() - startTime) / 1000000000);
}
public static BigInteger fib_efficient(int n, HashMap<Integer, BigInteger> d) {
numFibCalls += 1;
if (d.containsKey(n)) {
return (d.get(n));
} else {
ans = (fib_efficient(n-1, d).add(fib_efficient(n-2, d)));
d.put(n, ans);
return ans;
}
}
}
If you are restarting the program every time you make a new fibonacci sequence, then your program most likely isn't the problem. It might just be the your processor got hot after running the program a few times, or a background process in your computer suddenly started, causing your program to slow down.
More memory java -Xmx=... or less caching
public static BigInteger fib_efficient(int n, HashMap<Integer, BigInteger> d) {
numFibCalls++;
if ((n & 3) <= 1) { // Every second is cached.
BigInteger cached = d.get(n);
if (cached != null) {
return cached;
} else {
BigInteger ans = fib_efficient(n-1, d).add(fib_efficient(n-2, d));
d.put(n, ans);
return ans;
}
} else {
return fib_efficient(n-1, d).add(fib_efficient(n-2, d));
}
}
Two subsequent numbers are cached out of four in order to stop the
recursion on both branches for:
fib(n) = fib(n-1) + fib(n-2)
BigInteger isn't the nicest class where performance and memory is concerned.
It started out good with less than a second but gradually became slower and now it takes more than 4 seconds.
What do you mean by this? Do you mean that you ran this exact same program with the same input and its run-time changed from < 1 second to > 4 seconds?
If you have the same exact code running with the same exact inputs in a deterministic algorithm...
then the differences are probably external to your code - maybe other processes are taking up more CPU on one run.
Do you mean that you increased the inputs from some value X to 10,000 and now it takes > 4 seconds?
Then that's just a matter of the algorithm taking longer with larger inputs, which is perfectly normal.
recursion and hashmaps to reduce complexity
That's not quite how complexity works. You have improved the best-case and the average-case, but you have done nothing to change the worst-case.
Now for some actual performance improvement advice
Stop printing out the results... that's eating up over 99% of your processing time. Seriously, though, switch out "System.out.println(i + " : " + fib_efficient(i, d))" with "fib_efficient(i,d)" and it'll execute over 100x faster.
Concatenating strings and printing to console are very expensive processes.
It happens because the complexity for Fibonacci is Big-O(n^2). This means that, the larger the input the time increases exponentially, as you can see in the graph for Big-O(n^2) in this link. Check this answer to see a complete explanation about it´s complexity.
Now, the complexity of your algorithm increases because you are using a HashMap to search and insert elements each time that function is invoked. Consider remove this HashMap.
I implemented a wordcount program with Java. Basically, the program takes a large file (in my tests, I used a 10 gb data file that contained numbers only), and counts the number of times each 'word' appears - in this case, a number (23723 for example might appear 243 times in the file).
Below is my implementation. I seek to improve it, with mainly performance in mind, but a few other things as well, and I am looking for some guidance. Here are a few of the issues I wish to correct:
Currently, the program is threaded and works properly. However, what I do is pass a chunk of memory (500MB/NUM_THREADS) to each thread, and each thread proceeds to wordcount. The problem here is that I have the main thread wait for ALL the threads to complete before passing more data to each thread. It isn't too much of a problem, but there is a period of time where a few threads will wait and do nothing for a while. I believe some sort of worker pool or executor service could solve this problem (I have not learned the syntax for this yet).
The program will only work for a file that contains integers. That's a problem. I struggled with this a lot, as I didn't know how to iterate through the data without creating loads of unused variables (using a String or even StringBuilder had awful performance). Currently, I use the fact that I know the input is an integer, and just store the temporary variables as an int, so no memory problems there. I want to be able to use some sort of delimiter, whether that delimiter be a space, or several characters.
I am using a global ConcurrentHashMap to story key value pairs. For example, if a thread finds a number "24624", it searches for that number in the map. If it exists, it will increase the value of that key by one. The value of the keys at the end represent the number of occurrences of that key. So is this the proper design? Would I gain in performance by giving each thread it's own hashmap, and then merging them all at the end?
Is there any other way of seeking through a file with an offset without using the class RandomAccessMemory? This class will only read into a byte array, which I then have to convert. I haven't timed this conversion, but maybe it could be faster to use something else.
I am open to other possibilities as well, this is just what comes to mind.
Note: Splitting the file is not an option I want to explore, as I might be deploying this on a server in which I should not be creating my own files, but if it would really be a performance boost, I might listen.
Other Note: I am new to java threading, as well as new to StackOverflow. Be gentle.
public class BigCount2 {
public static void main(String[] args) throws IOException, InterruptedException {
int num, counter;
long i, j;
String delimiterString = " ";
ArrayList<Character> delim = new ArrayList<Character>();
for (char c : delimiterString.toCharArray()) {
delim.add(c);
}
int counter2 = 0;
num = Integer.parseInt(args[0]);
int bytesToRead = 1024 * 1024 * 1024 / 2; //500 MB, size of loop
int remainder = bytesToRead % num;
int k = 0;
bytesToRead = bytesToRead - remainder;
int byr = bytesToRead / num;
String filepath = "C:/Users/Daniel/Desktop/int-dataset-10g.dat";
RandomAccessFile file = new RandomAccessFile(filepath, "r");
Thread[] t = new Thread [num];//array of threads
ConcurrentMap<Integer, Integer> wordCountMap = new ConcurrentHashMap<Integer, Integer>(25000);
byte [] byteArray = new byte [byr]; //allocates 500mb to a 2D byte array
char[] newbyte;
for (i = 0; i < file.length(); i += bytesToRead) {
counter = 0;
for (j = 0; j < bytesToRead; j += byr) {
file.seek(i + j);
file.read(byteArray, 0, byr);
newbyte = new String(byteArray).toCharArray();
t[counter] = new Thread(
new BigCountThread2(counter,
newbyte,
delim,
wordCountMap));//giving each thread t[i] different file fileReader[i]
t[counter].start();
counter++;
newbyte = null;
}
for (k = 0; k < num; k++){
t[k].join(); //main thread continues after ALL threads have finished.
}
counter2++;
System.gc();
}
file.close();
System.exit(0);
}
}
class BigCountThread2 implements Runnable {
private final ConcurrentMap<Integer, Integer> wordCountMap;
char [] newbyte;
private ArrayList<Character> delim;
private int threadId; //use for later
BigCountThread2(int tid,
char[] newbyte,
ArrayList<Character> delim,
ConcurrentMap<Integer, Integer> wordCountMap) {
this.delim = delim;
threadId = tid;
this.wordCountMap = wordCountMap;
this.newbyte = newbyte;
}
public void run() {
int intCheck = 0;
int counter = 0; int i = 0; Integer check; int j =0; int temp = 0; int intbuilder = 0;
for (i = 0; i < newbyte.length; i++) {
intCheck = Character.getNumericValue(newbyte[i]);
if (newbyte[i] == ' ' || intCheck == -1) { //once a delimiter is found, the current tempArray needs to be added to the MAP
check = wordCountMap.putIfAbsent(intbuilder, 1);
if (check != null) { //if returns null, then it is the first instance
wordCountMap.put(intbuilder, wordCountMap.get(intbuilder) + 1);
}
intbuilder = 0;
}
else {
intbuilder = (intbuilder * 10) + intCheck;
counter++;
}
}
}
}
Some thoughts on a little of most ..
.. I believe some sort of worker pool or executor service could solve this problem (I have not learned the syntax for this yet).
If all the threads take about the same time to process the same amount of data, then there really isn't that much of a "problem" here.
However, one nice thing about a Thread Pool is it allows one to rather trivially adjust some basic parameters such as number of concurrent workers. Furthermore, using an executor service and Futures can provide an additional level of abstraction; in this case it could be especially handy if each thread returned a map as the result.
The program will only work for a file that contains integers. That's a problem. I struggled with this a lot, as I didn't know how to iterate through the data without creating loads of unused variables (using a String or even StringBuilder had awful performance) ..
This sounds like an implementation issue. While I would first try a StreamTokenizer (because it's already written), if doing it manually, I would check out the source - a good bit of that can be omitted when simplifying the notion of a "token". (It uses a temporary array to build the token.)
I am using a global ConcurrentHashMap to story key value pairs. .. So is this the proper design? Would I gain in performance by giving each thread it's own hashmap, and then merging them all at the end?
It would reduce locking and may increase performance to use a separate map per thread and merge strategy. Furthermore, the current implementation is broken as wordCountMap.put(intbuilder, wordCountMap.get(intbuilder) + 1) is not atomic and thus the operation might under count. I would use a separate map simply because reducing mutable shared state makes a threaded program much easier to reason about.
Is there any other way of seeking through a file with an offset without using the class RandomAccessMemory? This class will only read into a byte array, which I then have to convert. I haven't timed this conversion, but maybe it could be faster to use something else.
Consider using a FileReader (and BufferedReader) per thread on the same file. This will avoid having to first copy the file into the array and slice it out for individual threads which, while the same amount of total reading, avoids having to soak up so much memory. The reading done is actually not random access, but merely sequential (with a "skip") starting from different offsets - each thread still works on a mutually exclusive range.
Also, the original code with the slicing is broken if an integer value was "cut" in half as each of the threads would read half the word. One work-about is have each thread skip the first word if it was a continuation from the previous block (i.e. scan one byte sooner) and then read-past the end of it's range as required to complete the last word.
I have two versions of a program with the same purpose: to calculate how many prime numbers there are between 0 and n.
The first version uses concurrency, a Callable class "does the math" and the results are retrieved though a Future array. There are as many created threads as processors in my computer (4).
The second version is implemented via RMI. All four servers are registered in the local host. The servers are working in paralell as well, obviously.
I would expect the second version to be slower than the first, because I guess the network would involve latency and the other version would just run the program concurrently.
However, the RMI version is around twice faster than the paralel version... Why is this happening?!
I didn't paste any code because it'd be huge, but ask for it in case you need it and I'll see what I can do...
EDIT: adding the code. I commented the sections where unnecessary code was to be posted.
Paralell version
public class taskPrimes implements Callable
{
private final long x;
private final long y;
private Long total = new Long(0);
public taskPrimes(long x, long y)
{
this.x = x;
this.y = y;
}
public static boolean isPrime(long n)
{
if (n<=1) return false ;
for (long i=2; i<=Math.sqrt(n); i++)
if (n%i == 0) return false;
return true;
}
public Long call()
{
for (long i=linf; i<=lsup;i++)
if (isPrime(i)) total++;
return total;
}
}
public class paralelPrimes
{
public static void main(String[] args) throws Exception
{
// here some variables...
int nTasks = Runtime.getRuntime().availableProcessors();
ArrayList<Future<Long>> partial = new ArrayList<Future<Long>>();
ThreadPoolExecutor ept = new ThreadPoolExecutor();
for(int i=0; i<nTasks; i++)
{
partial.add(ept.submit(new taskPrimes(x, y))); // x and y are the limits of the range
// sliding window here
}
for(Future<Long> iterator:partial)
try { total += iterator.get(); } catch (Exception e) {}
}
}
RMI version
Server
public class serverPrimes
extends UnicastRemoteObject
implements interfacePrimes
{
public serverPrimes() throws RemoteException {}
#Override
public int primes(int x, int y) throws RemoteException
{
int total = 0;
for(int i=x; i<=y; i++)
if(isPrime(i)) total++;
return total;
}
#Override
public boolean isPrime(int n) throws RemoteException
{
if (n<=1) return false;
for (int i=2; i<=Math.sqrt(n); i++)
if (n%i == 0) return false ;
return true;
}
public static void main(String[] args) throws Exception
{
interfacePrimes RemoteObject1 = new serverPrimes();
interfacePrimes RemoteObject2 = new serverPrimes();
interfacePrimes RemoteObject3 = new serverPrimes();
interfacePrimes RemoteObject4 = new serverPrimes();
Naming.bind("Server1", RemoteObject1);
Naming.bind("Server2", RemoteObject2);
Naming.bind("Server3", RemoteObject3);
Naming.bind("Server4", RemoteObject4);
}
}
Client
public class clientPrimes implements Runnable
{
private int x;
private int y;
private interfacePrimes RemoteObjectReference;
private static AtomicInteger total = new AtomicInteger();
public clientPrimes(int x, int y, interfacePrimes RemoteObjectReference)
{
this.x = x;
this.y = y;
this.RemoteObjectReference = RemoteObjectReference;
}
#Override
public void run()
{
try
{
total.addAndGet(RemoteObjectReference.primes(x, y));
}
catch (RemoteException e) {}
}
public static void main(String[] args) throws Exception
{
// some variables here...
int nServers = 4;
ExecutorService e = Executors.newFixedThreadPool(nServers);
double t = System.nanoTime();
for (int i=1; i<=nServers; i++)
{
e.submit(new clientPrimes(xVentana, yVentana, (interfacePrimes)Naming.lookup("//localhost/Server"+i)));
// sliding window here
}
e.shutdown();
while(!e.isTerminated());
t = System.nanoTime()-t;
}
}
One interesting thing to consider is that, by default, the jvm runs in client mode. This means that threads won't span over the cores in the most agressive way. Trying to run the program with -server option can influence the result although, as mentioned, the algorithm design is crucial the concurrent version may have bottlenecks. There is little chance that, given the problem, there is a bottleneck in your algorithm, but it sure needs to be considered.
The rmi version truly runs in parallel because each object runs on a different machine, since this tends to be a processing problem more than a communication problem then the latency plays a non important part.
[UPDATE]
Now that I saw your code lets get into some more details.
You are relying on the ThreadExecutorPool and Future to perform the thread control and synchronization for you, this means (by the documentation) that your running objects will be allocated on an existing thread and once your object finishes its computation the thread will be returned to that pool, on the other hand the Future will check periodically if the computation has finished so it can collect the values.
This scenario would be best fit for some computation that is performed periodically in a way that the ThreadPool could increase performance by having the threads pre-allocated (having the overhead of thread creation only on the first time the threads aren't there).
Your implementation is correct, but it is more centered on the programmer convinience (there is nothing wrong with this, I am always defending this point of view) than on system performance.
The RMI version performs differently due (mainly) of 2 things:
1 - you said you are running on the same machine, most OS will recognize localhost, 127.0.0.1 or even the real self ip address as being its self address and perform some optimizations on the communication, so little overhead from the network here.
2 - the RMI system will create a separate thread for each server object you created (as I mentioned before) and these servers will starting computing as soon as they get called.
Things you should try to experiment:
1 - Try to run your RMI version truly on a network, if you can configure it for 10Mbps would be better to see communication overhead (although, since it is a one shot communication it may have little impact to notice, you could chance you client application to call for the calculation multiple times, and then you see the lattency in the way)
2 - Try to change you parallel implementation to use Threads directly with no Future (you could use Thread.join to monitor execution end) and then use the -server option on the machine (although, sometimes, the JVM performs a check to see if the machine configuration can be truly said to be a server and will decline to move to that profile). The main problem is that if your threads doesn't get to use all the computer cores you won't see any performance improvent. Also try to perform the calculations many time to overcome the overhead of thread creation.
Hope that helps to elucidate the situation :)
Cheers
It depends on how your Algorithms are designed for parallel and concurrent solutions. There is no a criteria where parallel must be better than concurrent or viceversa. By example if your concurrent solution has many synchronized blocks it can drop your performance, in the other case maybe the communication in your parallel algorithm is minimum then there is no overhead on network.
If you can get a copy o the book of Peter Pacheco it can clear some ideas:http://www.cs.usfca.edu/~peter/ipp/
Given the details you provided, it will mostly depend on how large a range you're using, and how efficiently you distribute the work to the servers.
For instance, I'll bet that for a small range N you will probably have no speedup from distributing via RMI. In this case, the RMI (network) overhead will likely outweigh the benefit of distributing over multiple servers. When N becomes large, and with an efficient distribution algorithm, this overhead will become more and more negligible with regards to the actual computation time.
For example, assuming homogenous servers, a relatively efficient distribution could be to tell each server to compute the primes for all the numbers n such that n % P = i, where n <= N, P is the number of servers, i is an index in the range [0, P-1] assigned to each server, and % is the modulo operation.
I have a simple recursive method, a depth first search. On each call, it checks if it's in a leaf, otherwise it expands the current node and calls itself on the children.
I'm trying to make it parallel, but I notice the following strange (for me) problem.
I measure execution time with System.currentTimeMillis().
When I break the search into a number of subsearches and add the total execution time, I get a bigger number than the sequential search. I only measure execution time, no communication or sync, etc. I would expect to get the same time when I add the times of the subtasks. This happens even if I just run one task after the other, so without threads. If I just break the search into some subtasks and run the subtasks one after the other, I get a bigger time.
If I add the number of method calls for the subtasks, I get the same number as the sequential search. So, basically, in both cases I do the same number of method calls, but I get different times.
I'm guessing there's some overhead on initial method calls or something else caused by a JVM mechanism. Any ideas what could it be?
For example, one sequential search takes around 3300 ms. If I break it into 13 tasks, it takes a total time of 3500ms.
My method looks like this:
private static final int dfs(State state) {
method_calls++;
if(state.isLeaf()){
return 1;
}
State[] children = state.expand();
int result = 0;
for (int i = 0; i < children.length; i++) {
result += dfs(children[i]);
}
return result;
}
Whenever I call it, I do it like this:
for(int i = 0; i < num_tasks; i++){
long start = System.currentTimeMillis();
dfs(tasks[i]);
totalTime += (System.currentTimeMillis() - start);
}
Problem is totalTime increases with num_tasks and I would expect to stay the same because the method_calls variable stays the same.
You should average out the numbers over longer runs. Secondly the precision of currentTimeMillis may not be sufficient, you can try using System.nanoTime().
As in all the programming languages, whenever you call a procedure or a method, you have to push the environment, initialize the new one, execute the programs instructions, return the value on the stack and finally reset the previous environment. It cost a bit! Create a thread cost also more!
I suppose that if you enlarge the researching tree you will have benefit by the parallelization.
Adding system clock time for several threads seems a weird idea. Either you are interested in the time until processing is complete, in which case adding doesn't make sense, or in cpu usage, in which case you should only count when the thread is actually scheduled to execute.
What probably happens is that at least part of the time, more threads are ready to execute than the system has cpu cores, and the scheduler puts one of your threads to sleep, which causes it to take longer to complete. It makes sense that this effect is exacerbated the more threads you use. (Even if your program uses less threads than you have cores, other programs (such as your development environment, ...) might).
If you are interested in CPU usage, you might wish to query ThreadMXBean.getCurrentThreadCpuTime
I'd expect to see Threads used. Something like this:
import java.util.concurrent.Executor;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
public class Puzzle {
static volatile long totalTime = 0;
private static int method_calls = 0;
/**
* #param args
*/
public static void main(String[] args) {
final int num_tasks = 13;
final State[] tasks = new State[num_tasks];
ExecutorService threadPool = Executors.newFixedThreadPool(5);
for(int i = 0; i < num_tasks; i++){
threadPool.submit(new DfsRunner(tasks[i]));
}
try {
threadPool.shutdown();
threadPool.awaitTermination(1, TimeUnit.SECONDS);
} catch (InterruptedException e) {
System.out.println("Interrupted");
}
System.out.println(method_calls + " Methods in " + totalTime + "msecs");
}
static final int dfs(State state) {
method_calls++;
if(state.isLeaf()){
return 1;
}
State[] children = state.expand();
int result = 0;
for (int i = 0; i < children.length; i++) {
result += dfs(children[i]);
}
return result;
}
}
With the runnable bit like this:
public class DfsRunner implements Runnable {
private State state;
public DfsRunner(State state) {
super();
this.state = state;
}
#Override
public void run() {
long start = System.currentTimeMillis();
Puzzle.dfs(state);
Puzzle.totalTime += (System.currentTimeMillis() - start);
}
}
not sure if this question should be here or in serverfault, but it's java-related so here it is:
I have two servers, with very similar technology:
server1 is Oracle/Sun x86 with dual x5670 CPU (2.93 GHz) (4 cores each), 12GB RAM.
server2 is Dell R610 with dual x5680 CPU (3.3 GHz) (6 cores each), 16GB RAM.
both are running Solaris x86, with exact same configuration.
both have turbo-boost enabled, and no hyper-threading.
server2 should therefore be SLIGHTLY faster than server1.
I'm running the following short test program on the two platforms.
import java.io.*;
public class TestProgram {
public static void main(String[] args) {
new TestProgram ();
}
public TestProgram () {
try {
PrintWriter writer = new PrintWriter(new FileOutputStream("perfs.txt", true), true);
for (int i = 0; i < 10000; i++) {
long t1 = System.nanoTime();
System.out.println("0123456789qwertyuiop0123456789qwertyuiop0123456789qwertyuiop0123456789qwertyuiop");
long t2 = System.nanoTime();
writer.println((t2-t1));
//try {
// Thread.sleep(1);
//}
//catch(Exception e) {
// System.out.println("thread sleep exception");
//}
}
}
catch(Exception e) {
e.printStackTrace(System.out);
}
}
}
I'm opening perfs.txt and averaging the results, I get:
server1: average = 1664 , trim 10% = 1615
server2: average = 1510 , trim 10% = 1429
which is a somewhat expected result (server2 perfs > server1 perfs).
now, I uncomment the "Thread.sleep(1)" part and test again, the results are now:
server1: average = 27598 , trim 10% = 26583
server2: average = 52320 , trim 10% = 39359
this time server2 perfs < server1 perfs
that doesn't make any sense to me...
obviously I'm looking at a way to improve server2 perfs in the second case. there must be some kind of configuration that is different, and I don't know which one.
OS are identical, java version are identical.
could it be linked to the number of cores ?
maybe it's a BIOS setting ? although BIOS are different (AMI vs Dell), settings seem pretty similar.
I'll update the Dell's BIOS soon and retest, but I would appreciate any insight...
thanks
I would try a different test program, try running somthing like this.
public class Timer implements Runnable
{
public void startTimer()
{
time = 0;
running = true;
thread = new Thread(this);
thread.start();
}
public int stopTimer()
{
running = false;
return time;
}
public void run()
{
try
{
while(running)
{
Thread.sleep(1);
time++;
}
}catch(Exception e){e.printStackTrace();}
}
private int time;
private Thread thread;
private boolean running;
}
Thats the timer now heres the main:
public class Main
{
public static void main(String args[])
{
Timer timer = new Timer();
timer.startTimer();
for(int x=0;x<1000;x++)
System.out.println("Testing!!");
System.out.println("\n\nTime Taken: "+timer.stopTimer());
}
}
I think this is a good way to test wich system is truely running faster. Try this and let me know how it goes.
Ok, I have a theory: the Thread.sleep() prevents the hotspot compiler from kicking in. Because you have a sleep, it assumes the loop isn't "hot", i.e that it doesn't matter too much how efficient the code in the loop is (because, after all, you're sleep's only purpose could be to slow things down).
Hence, you add a Thread.sleep() inside the loop, and the other stuff in the loop also runs slower.
I wonder if it might make a difference if you have a loop inside a loop and measure the performance of the inner loop? (and only have the Thread.sleep() in the outer loop). In this case the compiler might optimize the inner loop (if there are enough iterations).
(Brings up a question: if this code is a test case extracted from production code, why does the production code sleep?)
I actually updated the BIOS on the DELL R610 and ensured all BIOS CPU parameters are adjusted for best low-latency performances (no hyper-threading, etc...).
it solved it. The performances with & without the Thread.sleep make sense, and the overall performances of the R610 in both cases are much better than the Sun.
It appears the original BIOS did not make a correct or a full usage of the nehalem capabilities (while the Sun did).
You are testing how fast the console updates. This is entirely OS and window dependent. If you run this in your IDE it will be much slower than running in an xterm. Even which font you use and how big your window is will make a big different to performance. If your window is closed while you run the test this will improve performance.
Here is how I would run the same test. This test is self contained and does the analysis you need.
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;
import java.util.Arrays;
public class TestProgram {
public static void main(String... args) throws IOException {
File file = new File("out.txt");
file.deleteOnExit();
PrintWriter out = new PrintWriter(new FileWriter(file), true);
int runs = 100000;
long[] times = new long[runs];
for (int i = -10000; i < runs; i++) {
long t1 = System.nanoTime();
out.println("0123456789qwertyuiop0123456789qwertyuiop0123456789qwertyuiop0123456789qwertyuiop");
long t2 = System.nanoTime();
if (i >= 0)
times[i] = t2 - t1;
}
out.close();
Arrays.sort(times);
System.out.printf("Median time was %,d ns, the 90%%tile was %,d ns%n", times[times.length / 2], times[times.length * 9 / 10]);
}
}
prints on a 2.6 GHz Xeon WIndows Vista box
Median time was 3,213 ns, the 90%tile was 3,981 ns