I'm writing a simple program that calculates the number Pi according to this formula. Before I elaborate more on the problem, let me say that I'm testing my program (written in Java 8) on a 12-core CPU with 24 threads. According to htop, when running the tests, I have no load on the server, so that is out of the question.
However, I expect this to have near-linear speedup, when it starts to choke at high number of threads (let's say at >8, when it gets off the y=x line). At that point, the time in which the program executes for the same parameters with different numbers of threads is constant and speedup kind of tops at 10
Without too much concrete information, I would like to know how can I analyze where my program chokes. In other words, what are some must-do's when it comes to checking parallel programs' speedup.
Related
I need to get an ideal number of threads in a batch program, which runs in batch framework supporting parallel mode, like parallel step in Spring Batch.
As far as I know, it is not good that there are too many threads to execute steps of a program, it may has negative effect to the performance of the program. Some factors could arise performance degradation(context switching, race condition when using shared resources(locking, sync..) ... (are there any other factors?)).
Of course the best way of getting the ideal number of threads is for me to have actual program tests adjusting the number of threads of the program. But in my situation, it is not that easy to have the actual test because many things are needed for the tests(persons, test scheduling, test data, etc..), which are too difficult for me to prepare now. So, before getting the actual tests, I want to know the way of getting a guessable ideal number of threads of my program, as best as I can.
What should I consider to get the ideal number of threads(steps) of my program?? number of CPU cores?? number of processes on a machine on which my program would run?? number of database connection??
Is there a rational way such as a formula in a situation like this?
The most important consideration is whether your application/calculation is CPU-bound or IO-bound.
If it's IO-bound (a single thread is spending most of its time waiting for external esources such as database connections, file systems, or other external sources of data) then you can assign (many) more threads than the number of available processors - of course how many depends also on how well the external resource scales though - local file systems, not that much probably.
If it's (mostly) CPU bound, then slightly over the number of
available processors is probably best.
General Equation:
Number of Threads <= (Number of cores) / (1 - blocking factor)
Where 0 <= blocking factor < 1
Number of Core of a machine : Runtime.getRuntime().availableProcessors()
Number of Thread you can parallelism, you will get by printing out this code :
ForkJoinPool.commonPool()
And the number parallelism is Number of Core of your machine - 1. Because that one is for main thread.
Source link
Time : 1:09:00
What should I consider to get the ideal number of threads(steps) of my program?? number of CPU cores?? number of processes on a machine on which my program would run?? number of database connection?? Is there a rational way such as a formula in a situation like this?
This is tremendously difficult to do without a lot of knowledge over the actual code that you are threading. As #Erwin mentions, IO versus CPU-bound operations are the key bits of knowledge that are needed before you can determine even if threading an application will result is any improvements. Even if you did manage to find the sweet spot for your particular hardware, you might boot on another server (or a different instance of a virtual cloud node) and see radically different performance numbers.
One thing to consider is to change the number of threads at runtime. The ThreadPoolExecutor.setCorePoolSize(...) is designed to be called after the thread-pool is in operation. You could expose some JMX hooks to do this for you manually.
You could also allow your application to monitor the application or system CPU usage at runtime and tweak the values based on that feedback. You could also keep AtomicLong throughput counters and dial the threads up and down at runtime trying to maximize the throughput. Getting that right might be tricky however.
I typically try to:
make a best guess at a thread number
instrument your application so you can determine the effects of different numbers of threads
allow it to be tweaked at runtime via JMX so I can see the affects
make sure the number of threads is configurable (via system property maybe) so you don't have to rerelease to try different thread numbers
I am making a program which tries to get all the possible outcomes. The threads in the program generate more threads (a little over a thousand). I am really bad at multithreading, I fear that the generation of threads won't stop. I am using the eclipse IDE which has a terminate button, will this stop all the running thread, if not is there any other way? Can the JVM handle this?
Yes. Clicking the terminate button in an eclipse launched JVM will halt that JVM, and that will stop all of the running threads (just like killing the JVM process).
As for running one thousand threads, I wouldn't recommend it... that sounds like a really slow approach (since each thread can only run a maximum of ~n/1000th of the time on n CPU cores).
Actually, there is no restrictions to open 1000 or more thread in Java programming language. However, the problem is it will slow down the works.
Do you know how the thread works? Thread just creates an environment with the help of the OS so that the application user can feel that the programs are running parallel. But, the actual scenario is different. The computer having a single core processor can handle one operation at an instant. Our OS just sends operations one after one, so that we the user can feel that they are running in parallel.
For example, let us consider a three threaded application. Each of its thread has a for loop. First thread adds numbers inside the loop and keep the result in a variable named result1, second thread multiplies numbers inside the loop and keep the result in a variable named result2, and third thread subtracts numbers inside the loop and keep the result in a variable named result3.
Now, if all these threads are started at the same instant and let all has same priority then the OS will send instructions one after one. If can send a number to add with result1. In the next instant, it can send a number to multiply with result2. In the next instant, it can send a number to subtract with result3. In the next instant, it can again add.
That means, actually a single core processor cannot compute three computation simultaneously. It compute one computation and make pause the remaining ones and go through in this way.
I think, now you understand why running 1000 thread will slow down the whole process. If the performance is not issue in the mentioned task and you just need the output you can run 1000+ thread.
But, if you need an improved performance you have to think something else. Have you hard about Map Reduce? Implementing map reduce by Hadoop framework you can get better performance in these type of issues. However, first you have to design your problem in map reduce frame. And this framework will compute you task in parallel with the help of more than one computers.
Another solution can be setting priority. In java you can set priorities in a thread.You can give critical tasks high priorities than simpler tasks. If your problem's tasks can be distinguished by high and low priorities tasks it will make the performance better definitely.
I was trying to get timing data for various Java programs. Then I had to perform some regression analysis based on this timing data. Here are the two methods I used to get the timing data:
System.currentTimeMillis(): I used this initially, but I wanted the timing data to be constant when the same program was run multiple
times. The variation was huge in this case. When two instances of the
same code were executed in parallel, the variation was even more. So
I dropped this and started looking for some profilers.
-XX countBytecodes Flag in Hotspot JVM: Since the variation in timing data was huge, I thought of measuring the number of byte codes executed, when this code was executed. This should have given a more static count, when the same program was executed multiple times. But This also had variations. When the programs were executed sequentially, the variations were small, but during parellel runs of the same code, the variations were huge. I also tried compiling using -Xint, but the results were similar.
So I am looking for some profiler that could give me the count of byte codes executed when a code is executed. The count should remain constant (or correlation close to 1) across runs of the same program. Or if there could be some other metric based on which I could get timing data, which should stay almost constant across multiple runs.
I wanted the timing data to be constant when the same program was run multiple times
That is not possible on a real machine unless it is designed for hard real time system which your machine will almost certainly be not.
I am looking for some profiler that could give me the count of byte codes executed when a code is executed.
Assuming you could do this, it wouldn't prove anything. You wouldn't be able to see for example that ++ is 90x cheaper than % depending on the hardware you run it on. You won't be able to see that a branch miss of an if is up to 100x more expensive than a speculative branch. You wouldn't be able to see that a memory access to an area of memory which triggers a TLB miss can be more expensive than copying 4 KB of data.
if there could be some other metric based on which I could get timing data, which should stay almost constant across multiple runs.
You can run it many times and take the average. This will hide any high results/outliers and give you a favourable idea of throughput. It can be a reproducible number for a given machine, if run long enough.
My question is quite simple:
I am working on Ubuntu and I wrote a program in Java (with Eclipse IDE).
The program does not read or write anything anywhere, it just make a lot of calculation and create many instance of home made classes.
The output of the program is simple: it write A, B or C in the terminal.(consider it as a random process)
I must run the program repetitively until I get 1000000 times A and count the number of times I got B and C. I did it, it works but it is too slow.
For example:output is:
"A:1000000
B:1012458
C:1458"
This is where I need your help:
I want to parallelize the program. I tried with multi-Threading but it did not work faster! So, while each simulation is independent, I want to make multi Processing. I would like, for example, create 10 Proccess and ask them to run the program until A appears 100000 times. (so 10 * 100000 = 1000000 as I want)
The problem is that I need to know the total number of B and C and for now I got 10 value of each.
How can I do? I tried the ProcessBuilder (http://docs.oracle.com/javase/7/docs/api/java/lang/ProcessBuilder.html) but I do not understand how it works!
The only idea I have so far is to ask my program (with A till 100000) 10 times in the terminal with the command:
"java Main & java Main & java Main & java Main & java Main & java Main & java Main & java Main & java Main & java Main"
But then I must make the sum of the B and C occurrence MANUALLY. I am sure there is a better way to do this! I thought about creating 10 files with the value of (A), B and C and then read all of them and summarize them it is really a lot of work just to sum some integer isn't it?
Thank you forwards, I'm waiting for help :D
ps: To answer easily, let's consider I have a program named "prog" that take only int argument that represent the number of A I want to reach.
Parallelization makes sense only if you have multicore CPU. Run java.lang.Runtime.availableProcessors() to know how many threads you should run.
Then, running 10 batches of 100000 repetitions is not the same as running one batch with 1000000 repetitions, since the internal state of your application is changing, so think if parallelization is applicable at all in your case.
To know the total number of A,B, and C results just use AtomicInteger for all threads. Each time check if the count of A is less than 1000000.
On a single machine, parallel processing is more efficient when using multiple threads as compared to multiple processes.
When running on a single-core/single-CPU system, however, parallel processing will only bring a small performance penalty but no performance benefit for pure calculations. - Yet, when for example multiple slow IO is involved, multi-threading may speed up the process after all.
For short: Multi-processing will always be slower than multi-threading.
You could try to make a main class which launch your prog using one of the versions of Runtime.exec(...) method. By this you could use its process' outputStream to transmit to the main program the value each of your processes has computed.
The title I admit is a bit misleading but I am sort of confused why this happens.
I've written a program in java that takes an argument x that instantiates x number of threads to do the work of the program. The machine i'm running it on has 8 cores / can handle 32 threads in parallel (each core has 4 hyperthreads). When I run the program past 8 threads (i.e. 22), I notice that if I run it with an even amount of threads, the program runs faster as opposed to when I run it with 23 threads (which is actually slower). The performance difference is about 10% between the two. Why would this be? Thread overhead doesn't really take this into account and I feel that as long as im running <32 threads, it should only be faster as I increase the # of threads.
To give you an idea what the program is doing, the program is taking a 1000 * 1000 array and each thread is assigned a portion of that array to update (roundoffs/leftovers in uneven are given to the last thread instantiated).
Is there any good reason for the odd/even thread performance difference?
Two reasons I can imagine:
The need to synchronize the memory access of your cores/threads. This will eventually invalidate CPU core caches and such things, which brings down performance. Try giving them really disjoint tasks, don't let them work on the same array. See: the memory isn't managed in individual bytes.
Hyperthreading CPUs often don't have full performance. They may for example have to share some floating point units. This doesn't mattern when e.g. one thread is integer-math heavy and the other is float-heavy. But having four threads each needing the floating point units means probably waiting, switching contexts, signalling the other thread, switching context back, waiting again...
Just two guesses. For example, you should have given the actual CPU you are using, the partitioning scheme you are, and a more detailed hint about the computational task.