Related
I need to get an ideal number of threads in a batch program, which runs in batch framework supporting parallel mode, like parallel step in Spring Batch.
As far as I know, it is not good that there are too many threads to execute steps of a program, it may has negative effect to the performance of the program. Some factors could arise performance degradation(context switching, race condition when using shared resources(locking, sync..) ... (are there any other factors?)).
Of course the best way of getting the ideal number of threads is for me to have actual program tests adjusting the number of threads of the program. But in my situation, it is not that easy to have the actual test because many things are needed for the tests(persons, test scheduling, test data, etc..), which are too difficult for me to prepare now. So, before getting the actual tests, I want to know the way of getting a guessable ideal number of threads of my program, as best as I can.
What should I consider to get the ideal number of threads(steps) of my program?? number of CPU cores?? number of processes on a machine on which my program would run?? number of database connection??
Is there a rational way such as a formula in a situation like this?
The most important consideration is whether your application/calculation is CPU-bound or IO-bound.
If it's IO-bound (a single thread is spending most of its time waiting for external esources such as database connections, file systems, or other external sources of data) then you can assign (many) more threads than the number of available processors - of course how many depends also on how well the external resource scales though - local file systems, not that much probably.
If it's (mostly) CPU bound, then slightly over the number of
available processors is probably best.
General Equation:
Number of Threads <= (Number of cores) / (1 - blocking factor)
Where 0 <= blocking factor < 1
Number of Core of a machine : Runtime.getRuntime().availableProcessors()
Number of Thread you can parallelism, you will get by printing out this code :
ForkJoinPool.commonPool()
And the number parallelism is Number of Core of your machine - 1. Because that one is for main thread.
Source link
Time : 1:09:00
What should I consider to get the ideal number of threads(steps) of my program?? number of CPU cores?? number of processes on a machine on which my program would run?? number of database connection?? Is there a rational way such as a formula in a situation like this?
This is tremendously difficult to do without a lot of knowledge over the actual code that you are threading. As #Erwin mentions, IO versus CPU-bound operations are the key bits of knowledge that are needed before you can determine even if threading an application will result is any improvements. Even if you did manage to find the sweet spot for your particular hardware, you might boot on another server (or a different instance of a virtual cloud node) and see radically different performance numbers.
One thing to consider is to change the number of threads at runtime. The ThreadPoolExecutor.setCorePoolSize(...) is designed to be called after the thread-pool is in operation. You could expose some JMX hooks to do this for you manually.
You could also allow your application to monitor the application or system CPU usage at runtime and tweak the values based on that feedback. You could also keep AtomicLong throughput counters and dial the threads up and down at runtime trying to maximize the throughput. Getting that right might be tricky however.
I typically try to:
make a best guess at a thread number
instrument your application so you can determine the effects of different numbers of threads
allow it to be tweaked at runtime via JMX so I can see the affects
make sure the number of threads is configurable (via system property maybe) so you don't have to rerelease to try different thread numbers
I am planning on building a Merge Sorting algorithm that uses multiple threads in Java, and I've looked around the Internet and SO (Multi-threading a merge sorting algorithm for example) but I can't seem to settle on an answer to some of my questions.
First of all, would the optimal number of threads created be the same as the number of cores of the CPU? Should I even consider logical cores when considering number of threads?
Second, what is the best way of implementing multi-threading in such an algorithm? I've heard there is more than one way of doing it (like inheriting from the "Thread" class or using implements Runnable, etc.).
Also, would using ArrayLists or LinkedLists be a better choice in this case, in terms of optimisation?
Any other notes/suggestions concerning the implementation are appreciated.
Cheers.
In Java 8, there is Arrays.parallelSort() which is also used by the Stream API if you request parallelism with parallelStream. The source to parallelSort should be pretty informative if you're looking into this for educational purposes.
...would the optimal number of threads created be the same as the number of cores of the CPU?
I would assume so. A merge sort should be memory bandwidth limited, not cpu bandwidth limited. The main gain from multi-threading early on would take advantage of each core's local cache, typically level 1 and level 2 cache. Usually level 3 cache is shared between cores, so the only gain there is if the merge process is relatively CPU bound compared to the speed of the level 3 cache. Once run sizes get large enough to exceed cache limits, then I'm not sure there's much to be gained from multi-threading.
Microsoft's stable_sort begins by using insertion sort to create sorted groups of 32 elements, probably to take advantage of local cache. I'm not sure if that really helps or not on current processors, since it's based on code written in 1994.
Currently, I'm running on a thread-less model that isn't working simply because I'm running out of memory before I can process the data I'm being handed. I've made all the changes that I can to optimize the code, and it's still just not quite quick enough.
Clearly I should move on to a threaded model. I'm wondering what the simplest, easiest way to do the following is:
The main thread passes some info to the worker
That worker performs some work that I'll refactor out of the main method
The workers will disappear and new ones will be instantiated when needed
I've never worked with java threading and from what I've read up on it seems pretty complicated, even if what I'm looking for seems pretty simple.
If you have multiple independent units of work of equal priority, the best solution is generally some sort of work queue, where a limited number of threads (the number chosen to optimize performance) sit in a while(true) loop dequeuing work units from the queue and executing them.
Generally the optimum number of threads is going to be the number of processors +/- 1, though in some cases a larger number will be optimal if the threads tend to get stalled by disk I/O requests or some such.
But keep in mind that tuning the entire system may be required. Eg, you may need more disk arms, and certainly more RAM may be required.
I'd start by having a read through Java Concurrency as refresher ;)
In particular, I would spend some time getting to know the Executors API as it will do most of what you've described without a lot of the overhead of dealing with to many locks ;)
Distributing the memory consumption to multiple threads will not change overall memory consumption. From what I read out of your question, I would like to step forward and tell you: Increase the heap of the Java engine, this will help. Looks like you have to optimize the Java startup parameters and not your code. If I am wrong, then you will have to buffer the data. To Disk! Not to a thread in the same memory model.
Could you please give me a real example what is latency-driven or performance-driven application ? Both have what differences , what requirement in design system in java ?
Thanks.
Examples
An example of a latency-driven Java application is a signal processor or command+control unit for a radar. The JEOPARD project recently implemented such a thing, and the AN/FPS-85 radar is another example. Both of these are Java examples, and both use an instance of Real-Time Java. The latter uses RTSJ.
Why are they "latency-driven?" Well, computations are only correct if they are delivered on time -- when the computation is intended to steer a phased-array radar beam such that it impacts the predicted location of an object under track, the computation is incorrect if it is late. Therefore, there is a latency bound on the loop which traverses the last paint of the object with the control steering the beam onto the next predicted location.
These types of systems do have throughput requirements, but they tend not to be the driving requirements. Instead, specific latencies for specific activities must be met for correct operation, and that is the primary correctness metric.
Design techniques for these systems.
There are two common approaches: The first is basically to ignore the time requirements (latency, etc...), get the code "working" in the sense of being computationally correct, and then conduct performance engineering/optimization until the system implicitly behaves as you want. The second is to articulate clear timeliness requirements and design with those requirements in mind for each component. Given my background, I'm strongly biased toward the second path because the cost to take a random conventional development through integration and test, and tune it for the correct behavior tends to be very high and very risky. The more performance/latency-dependent the system is, the more you should ignore the rule "avoid premature optimization." It's not optimization if it's a correctness criteria. (This is not an excuse to write murky, fast code, but a bias.)
If some measure of end-to-end latency is required, a typical approach is to analyze what you expect to be the stressing conditions and develop a "latency budget", allocating portions of the latency to sequential bits of computation. As the system evolves, the budget may change around, but it becomes a useful design and test tool.
Finally, in Java, this might be manifest in three different approaches, which are really on a spectrum:
1) Just build the damn thing, and tune it once it more or less works. (Conventional design usually works this way.)
2) Build the thing, but also build in instrumentation/metrics to explicitly include latency context as work units progress through your software. A simple example of this is to timestamp arriving data and pass that timestamp along with the packet/unit as it is operated on. This is really easy for some systems, and basically impossible for others. If it's possible, it's highly recommended because then the timeliness context is explicitly available and may be used when making resource management decisions (i.e., assigning thread priorities, deadlines, queue priorities, etc...)
3) Do the analysis up-front, and use a real-time stack with formal timeliness parameters. This is the heavyweight solution, and is appropriate when you have high-criticality, safety-critical, or simply hard real-time constraints. Even if you aren't in that world, RTSJ implementations like Oracle's JavaRTS still offer benefits for soft real-time systems simply because they reduce jitter/non-determinism. There is usually a tradeoff here against raw throughput performance.
I have only addressed the computational side here. Obviously if your system includes or even is defined by networks, there's a whole world of latency/QoS management on that side. Common interfaces to time-sensitive Java applications there might include RSVP or perhaps specific middleware like DDS or CORBA or whatever. Probably half of the existing time-sensitive applications eschew middleware in favor of their own TCP, UDP, raw IP, or even specialized low-level solution, or build on top of a proprietary/special purpose bus.
Best Case vs. Common Case
In networking terms, throughput and latency are distinct dimensions of system performance. Throughput measures the rate (units per second) at which the system can process / transfer information. Latency measures the time (seconds) by which a computation/communication completes. Both of these can be used in common- or worst-case descriptions of performance, though it's a little hard to get your arms around "worst-case throughput" in many settings. For a specific example, consider the difference between a satellite link and a copper link over the same distance. In that setting, the satellite link has high latency (10's to 100's of milliseconds) because of speed of light time, but may also have very high bandwidth, and thus higher throughput. A single copper cable might have lower latency, but also have lower throughput (due to lower bandwidth).
In the computational setting, latency tends to be a measure of worst-case computation (though you often care about average latency, too), while throughput tends to be a measure of common-case computation rate. Examples of common latency metrics might be task-switch latency, interrupt service latency, packet service latency, etc.
Real-time or "time-critical" systems TEND to be dominated by concern for worst-case behaviors, and worst-case latencies in particular. Conventional/general-purpose systems TEND to be dominated by concern for maximum throughput. Soft real-time systems (e.g., VOIP or media) tend to manage both simultaneously, and tolerate a wider range of tradeoffs. There are corner cases like user interfaces, where perceived performance is a complicated mixture of both.
Edit to add: Some related, Java-specific SO questions. Coded using primitives only? and RTSJ implementations.
Latency is a networking term, think of it as "time to get the first byte."
Bandwidth is the other, related term networking term. Think of it as "time to transfer a large block of data."
These two things are more or less independent factors. For example, NetFlix sending you a BluRay is high latency (it takes a long time to get the first bit) but also high bandwidth (you gets lots and lots of data in one fell swoop).
Performance is a higher level concept. Performance is totally subjective - it can really only be discussed as as a delta compared to another system.
Latency, bandwidth, CPU, memory, bus, disk, and of course the code itself are all a factor in dealing with performance.
I recently inherited a small Java program that takes information from a large database, does some processing and produces a detailed image regarding the information. The original author wrote the code using a single thread, then later modified it to allow it to use multiple threads.
In the code he defines a constant;
// number of threads
public static final int THREADS = Runtime.getRuntime().availableProcessors();
Which then sets the number of threads that are used to create the image.
I understand his reasoning that the number of threads cannot be greater than the number of available processors, so set it the the amount to get the full potential out of the processor(s). Is this correct? or is there a better way to utilize the full potential of the processor(s)?
EDIT: To give some more clarification, The specific algorithm that is being threaded scales to the resolution of the picture being created, (1 thread per pixel). That is obviously not the best solution though. The work that this algorithm does is what takes all the time, and is wholly mathematical operations, there are no locks or other factors that will cause any given thread to sleep. I just want to maximize the programs CPU utilization to decrease the time to completion.
Threads are fine, but as others have noted, you have to be highly aware of your bottlenecks. Your algorithm sounds like it would be susceptible to cache contention between multiple CPUs - this is particularly nasty because it has the potential to hit the performance of all of your threads (normally you think of using multiple threads to continue processing while waiting for slow or high latency IO operations).
Cache contention is a very important aspect of using multi CPUs to process a highly parallelized algorithm: Make sure that you take your memory utilization into account. If you can construct your data objects so each thread has it's own memory that it is working on, you can greatly reduce cache contention between the CPUs. For example, it may be easier to have a big array of ints and have different threads working on different parts of that array - but in Java, the bounds checks on that array are going to be trying to access the same address in memory, which can cause a given CPU to have to reload data from L2 or L3 cache.
Splitting the data into it's own data structures, and configure those data structures so they are thread local (might even be more optimal to use ThreadLocal - that actually uses constructs in the OS that provide guarantees that the CPU can use to optimize cache.
The best piece of advice I can give you is test, test, test. Don't make assumptions about how CPUs will perform - there is a huge amount of magic going on in CPUs these days, often with counterintuitive results. Note also that the JIT runtime optimization will add an additional layer of complexity here (maybe good, maybe not).
On the one hand, you'd like to think Threads == CPU/Cores makes perfect sense. Why have a thread if there's nothing to run it?
The detail boils down to "what are the threads doing". A thread that's idle waiting for a network packet or a disk block is CPU time wasted.
If your threads are CPU heavy, then a 1:1 correlation makes some sense. If you have a single "read the DB" thread that feeds the other threads, and a single "Dump the data" thread and pulls data from the CPU threads and create output, those two could most likely easily share a CPU while the CPU heavy threads keep churning away.
The real answer, as with all sorts of things, is to measure it. Since the number is configurable (apparently), configure it! Run it with 1:1 threads to CPUs, 2:1, 1.5:1, whatever, and time the results. Fast one wins.
The number that your application needs; no more, and no less.
Obviously, if you're writing an application which contains some parallelisable algorithm, then you can probably start benchmarking to find a good balance in the number of threads, but bear in mind that hundreds of threads won't speed up any operation.
If your algorithm can't be parallelised, then no number of additional threads is going to help.
Yes, that's a perfectly reasonable approach. One thread per processor/core will maximize processing power and minimize context switching. I'd probably leave that as-is unless I found a problem via benchmarking/profiling.
One thing to note is that the JVM does not guarantee availableProcessors() will be constant, so technically, you should check it immediately before spawning your threads. I doubt that this value is likely to change at runtime on typical computers, though.
P.S. As others have pointed out, if your process is not CPU-bound, this approach is unlikely to be optimal. Since you say these threads are being used to generate images, though, I assume you are CPU bound.
number of processors is a good start; but if those threads do a lot of i/o, then might be better with more... or less.
first think of what are the resources available and what do you want to optimise (least time to finish, least impact to other tasks, etc). then do the math.
sometimes it could be better if you dedicate a thread or two to each i/o resource, and the others fight for CPU. the analisys is usually easier on these designs.
The benefit of using threads is to reduce wall-clock execution time of your program by allowing your program to work on a different part of the job while another part is waiting for something to happen (usually I/O). If your program is totally CPU bound adding threads will only slow it down. If it is fully or partially I/O bound, adding threads may help but there's a balance point to be struck between the overhead of adding threads and the additional work that will get accomplished. To make the number of threads equal to the number of processors will yield peak performance if the program is totally, or near-totally CPU-bound.
As with many questions with the word "should" in them, the answer is, "It depends". If you think you can get better performance, adjust the number of threads up or down and benchmark the application's performance. Also take into account any other factors that might influence the decision (if your application is eating 100% of the computer's available horsepower, the performance of other applications will be reduced).
This assumes that the multi-threaded code is written properly etc. If the original developer only had one CPU, he would never have had a chance to experience problems with poorly-written threading code. So you should probably test behaviour as well as performance when adjusting the number of threads.
By the way, you might want to consider allowing the number of threads to be configured at run time instead of compile time to make this whole process easier.
After seeing your edit, it's quite possible that one thread per CPU is as good as it gets. Your application seems quite parallelizable. If you have extra hardware you can use GridGain to grid-enable your app and have it run on multiple machines. That's probably about the only thing, beyond buying faster / more cores, that will speed it up.