Related
Goal: Execute certain code every once in a while.
Question: In terms of performance, is there a significant difference between:
while(true) {
execute();
Thread.sleep(10 * 1000);
}
and
executor.scheduleWithFixedDelay(runnableWithoutSleep, 0, 10, TimeUnit.SECONDS);
?
Of course, the latter option is more kosher. Yet, I would like to know whether I should embark on an adventure called "Spend a few days refactoring legacy code to say goodbye to Thread.sleep()".
Update:
This code runs in super/mega/hyper high-load environment.
You're dealing with sleep times termed in tens of seconds. The possible savings by changing your sleep option here is likely nanoseconds or microseconds.
I'd prefer the latter style every time, but if you have the former and it's going to cost you a lot to change it, "improving performance" isn't a particularly good justification.
EDIT re: 8000 threads
8000 threads is an awful lot; I might move to the scheduled executor just so that you can control the amount of load put on your system. Your point about varying wakeup times is something to be aware of, although I would argue that the bigger risk is a stampede of threads all sleeping and then waking in close succession and competing for all the system resources.
I would spend the time to throw these all in a fixed thread pool scheduled executor. Only have as many running concurrently as you have available of the most limited resource (for example, # cores, or # IO paths) plus a few to pick up any slop. This will give you good throughput at the expense of latency.
With the Thread.sleep() method it will be very hard to control what is going on, and you will likely lose out on both throughput and latency.
If you need more detailed advice, you'll probably have to describe what you're trying to do in more detail.
Since you haven't mentioned the Java version, so, things might change.
As I recall from the source code of Java, the prime difference that comes is the way things are written internally.
For Sun Java 1.6 if you use the second approach the native code also brings in the wait and notify calls to the system. So, in a way more thread efficient and CPU friendly.
But then again you loose the control and it becomes more unpredictable for your code - consider you want to sleep for 10 seconds.
So, if you want more predictability - surely you can go with option 1.
Also, on a side note, in the legacy systems when you encounter things like this - 80% chances there are now better ways of doing it- but the magic numbers are there for a reason(the rest 20%) so, change it at own risk :)
There are different scenarios,
The Timer creates a queue of tasks that is continually updated. When the Timer is done, it may not be garbage collected immediately. So creating more Timers only adds more objects onto the heap. Thread.sleep() only pauses the thread, so memory overhead would be extremely low
Timer/TimerTask also takes into account the execution time of your task, so it will be a bit more accurate. And it deals better with multithreading issues (such as avoiding deadlocks etc.).
If you thread get exception and gets killed, that is a problem. But TimerTask will take care of it. It will run irrespective of failure in previous run
The advantage of TimerTask is that it expresses your intention much better (i.e. code readability), and it already has the cancel() feature implemented.
Reference is taken from here
You said you are running in a "mega... high-load environment" so if I understand you correctly you have many such threads simultaneously sleeping like your code example. It takes less CPU time to reuse a thread than to kill and create a new one, and the refactoring may allow you to reuse threads.
You can create a thread pool by using a ScheduledThreadPoolExecutor with a corePoolSize greater than 1. Then when you call scheduleWithFixedDelay on that thread pool, if a thread is available it will be reused.
This change may reduce CPU utilization as threads are being reused rather than destroyed and created, but the degree of reduction will depend on the tasks they're doing, the number of threads in the pool, etc. Memory usage will also go down if some of the tasks overlap since there will be less threads sitting idle at once.
QUESTION
How do I scale to use more threads if and only if there is free cpu?
Something like a ThreadPoolExecutor that uses more threads when cpu cores are idle, and less or just one if not.
USE CASE
Current situation:
My Java server app processes requests and serves results.
There is a ThreadPoolExecutor to serve the requests with a reasonable number of max threads following the principle: number of cpu cores = number of max threads.
The work performed is cpu heavy, and there's some disk IO (DBs).
The code is linear, single threaded.
A single request takes between 50 and 500 ms to process.
Sometimes there are just a few requests per minute, and other times there are 30 simultaneous.
A modern server with 12 cores handles the load nicely.
The throughput is good, the latency is ok.
Desired improvement:
When there is a low number of requests, as is the case most of the time, many cpu cores are idle.
Latency could be improved in this case by running some of the code for a single request multi-threaded.
Some prototyping shows improvements, but as soon as I test with a higher number of concurrent requests,
the server goes bananas. Throughput goes down, memory consumption goes overboard.
30 simultaneous requests sharing a queue of 10 meaning that 10 can run at most while 20 are waiting,
and each of the 10 uses up to 8 threads at once for parallelism, seems to be too much for a machine
with 12 cores (out of which 6 are virtual).
This seems to me like a common use case, yet I could not find information by searching.
IDEAS
1) request counting
One idea is to count the current number of processed requests. If 1 or low then do more parallelism,
if high then don't do any and continue single-threaded as before.
This sounds simple to implement. Drawbacks are: request counter resetting must not contain bugs,
think finally. And it does not actually check available cpu, maybe another process uses cpu also.
In my case the machine is dedicated to just this application, but still.
2) actual cpu querying
I'd think that the correct approach would be to just ask the cpu, and then decide.
Since Java7 there is OperatingSystemMXBean.getSystemCpuLoad() see http://docs.oracle.com/javase/7/docs/jre/api/management/extension/com/sun/management/OperatingSystemMXBean.html#getSystemCpuLoad()
but I can't find any webpage that mentions getSystemCpuLoad and ThreadPoolExecutor, or a similar
combination of keywords, which tells me that's not a good path to go.
The JavaDoc says "Returns the "recent cpu usage" for the whole system", and I'm wondering what
"recent cpu usage" means, how recent that is, and how expensive that call is.
UPDATE
I had left this question open for a while to see if more input is coming. Nope. Although I don't like the "no-can-do" answer to technical questions, I'm going to accept Holger's answer now. He has good reputation, good arguments, and others have approved his answer.
Myself I had experimented with idea 2 a bit. I queried the getSystemCpuLoad() in tasks to decide how large their own ExecutorService could be. As Holger wrote, when there is a SINGLE ExecutorService, resources can be managed well. But as soon as tasks start their own tasks, they cannot - it didn't work out for me.
There is no way of limiting based on “free CPU” and it wouldn’t work anyway. The information about “free CPU” is outdated as soon as you get it. Suppose you have twelve threads running concurrently and detecting at the same time that there is one free CPU core and decide to schedule a sub-task…
What you can do is limiting the maximum resource consumption which works quite well when using a single ExecutorService with a maximum number of threads for all tasks.
The tricky part is the dependency of the tasks on the result of the sub-tasks which are enqueued at a later time and might still be pending due to the the limited number of worker threads.
This can be adjusted by revoking the parallel execution if the task detects that its sub-task is still pending. For this to work, create a FutureTask for the sub-task manually and schedule it with execute rather than submit. Then proceed within the task as normally and at the place where you would perform the sub-task in a sequential implementation check whether you can remove the FutureTask from the ThreadPoolExecutor. Unlike cancel this works only if it has not started yet and hence is an indicator that there are no free threads. So if remove returns true you can perform the sub-task in-place letting all other threads perform tasks rather than sub-tasks. Otherwise, you can wait for the result.
At this place it’s worth noting that it is ok to have more threads than CPU cores if the tasks accommodate I/O operations (or may wait for sub-tasks). The important point here is to have a limit.
FutureTask<Integer> coWorker = new FutureTask<>(/* callable wrapping sub-task*/);
executor.execute(coWorker);
// proceed in the task’s sequence
if(executor.remove(coWorker)) coWorker.run();// do in-place if needed
subTaskResult=coWorker.get();
// proceed
It sounds like the ForkJoinPool introduced in Java 7 would be exactly what you need. The ForkJoinPool is specifically designed to keep all your CPUs exactly busy meaning that there are as many threads as there are CPUs and that all those threads are also working and not blocking (For the later make sure that you use ManagedBlockers for DB queries).
In a ForkJoinTask there is the method getSurplusQueuedTaskCount for which the JavaDoc says "This value may be useful for heuristic decisions about whether to fork other tasks." and as such serves as a better replacement for your getSystemCpuLoad solution to make decisions about task decompositions. This allows you to reduce the number of decompositions when system load is high and thus reduce the impact of the task decomposition overhead.
Also see my answer here for some more indepth explanation about the principles of Fork/Join-pools.
I'm developing a small multi-thread application (in java) to help me understand it. As I researched about it, I learned that the ideal amount of threads you would like the number supported by the processor (ie. 4 in an Intel i3, 8 in an Intel i7, I think). But swing alone already has 3 threads + 1 thread (the main, in this case). Does that means that I won't have any significant improvement in a processor which supports 4 threads? Will the swing threads just consume all the processor threads and everything else will just run on the same processor? Is it worthed to multi-thread it (performance-wise) even with those swing threads?
OBS: A maybe important observation that needs to be made is that I will be using a JFrame and doing active-rendering. That's probably as far as I will go with swing.
I learned that the ideal amount of threads you would like the number supported by the processor
That statement is only true if your Threads are occupying the whole CPU. For example the Swing thread (Event Dispatch Thread) is most of the time just waiting for user input.
The number one thing that most threads do is waiting. They are just there so they are ready to go at the instant the system needs their services.
The comment about the ideal thread count goes for the number of threads at 100% workload.
Swing's threads spend a lot of time being idle. The ideal number of threads is wrt threads executing at or near 100% processor time.
You may still not see significant improvements due to other factors, but the threads inherent in swing shouldn't be a concern.
The ideal # of threads is not necessarily governed by the # of cpu's (cores) you have. There's a lot of tuning involved base on the actual code you are executing.
For example, let's take a Runnable which executions some database queries (doesn't matter what). Most of the thread time will be spent blocked waiting for a response from the database. So if you have 4 cores, and execute 4 threads. The odds are that at any given time, many of them are blocked on db calls. You can easily spawn more threads with no ill effects on your cpu. In this case you're limited not by the specs of your machine, but by the degree of concurrency which the db will handle.
Another example would be file I/O, which spends most of it's time waiting for the I/O subsystem to respond with data.
The only real way is evaluation of your multi-threaded code, along with trial and error for a given environment.
Yes, they do but only minimally. There are other threads such as GC and finalizer threads as well that are running in the background. These are all necessary for the JVM to operate just as the Swing threads are necessary for Swing to work.
You shouldn't have to worry about them unless you are on a dramatically small system with little resources or CPU capacity.
With modern systems, many of which have multiple processors and/or multiple cores, the JVM and the OS will run these other threads on other processors and still give your user threads all of the processor power that you will need.
Also, most of the background Swing threads are in wait loops waiting to handle events and make display changes. Unless you do something wrong, they should make up a small amount of your application processor requirements.
I learned that the ideal amount of threads you would like the number supported by the processor
As #Robin mentioned, this is only necessary when you are trying to optimize a program that has a number of CPU-bound operations. For example, our application typically has 1000s of threads but 8 processors and still is very responsive since the threads are all waiting for IO or events. The only time you need to worry about the number of CPUs is when you are doing processor intensive operations and you are trying to maximize your throughput.
I'm toying with the idea of writing a physics simulation software in which each physical element would be simulated in its own thread.
There would be several advantages to this approach. It would be conceptually very close to how the real world works. It would be much easier to scale the system to multiple machines.
However, for this to work I need to make sure that all threads run at the same speed, with a rather liberal interpretation of 'same'. Say within 1% of each others.
That's why I don't necessarily need a Thread.join() like solution. I don't want some uber-controlling school mistress that ensures all threads regularly synchronize with each others. I just need to be able to ask the runtime (whichever it is---could be Java, Erlang, or whatever is most appropriate for this problem) to run the threads at a more or less equal speed.
Any suggestions would be extremely appreciated.
UPDATE 2009-03-16
I wanted to thank everyone who answered this question, in particular all those whose answer was essentially "DON'T DO THIS". I understand my problem much better now thanks to everybody's comments and I am less sure I should continue as I originally planned. Nevertheless I felt that Peter's answer was the best answer to the question itself, which is why I accepted it.
You can't really do this without coordination. What if one element ended up needing cheaper calculations than another (in a potentially non-obvious way)?
You don't necessarily need an uber-controller - you could just keep some sort of step counter per thread, and have a global counter indicating the "slowest" thread. (When each thread has done some work, it would have to check whether it had fallen behind the others, and update the counter if so.) If a thread notices it's a long way ahead of the slowest thread, it could just wait briefly (potentially on a monitor).
Just do this every so often to avoid having too much overhead due to shared data contention and I think it could work reasonably well.
You'll need some kind of synchronization. CyclicBarrier class has what you need:
A synchronization aid that allows a
set of threads to all wait for each
other to reach a common barrier point.
CyclicBarriers are useful in programs
involving a fixed sized party of
threads that must occasionally wait
for each other. The barrier is called
cyclic because it can be re-used after
the waiting threads are released.
After each 'tick', you can let all your threads to wait for others, which were slower. When remaining threads reach the barrier, they all will continue.
Threads are meant to run completely independent of each other, which means synchronizing them in any way is always a pain. In your case, you need a central "clock" because there is no way to tell the VM that each thread should get the same amount of ... uh ... what should it get? The same amount of RAM? Probably doesn't matter. The same amount of CPU? Are all your objects so similar that each needs the same number of assembler instructions?
So my suggestion is to use a central clock which broadcasts clock ticks to every process. All threads within each process read the ticks (which should be absolute), calculate the difference to the last tick they saw and then update their internal model accordingly.
When a thread is done updating, it must put itself to sleep; waiting for the next tick. In Java, use wait() on the "tick received" lock and wake all threads with "notifyAll()".
I'd recommend not using threads wherever possible because they just add problems later if you're not careful. When doing physics simulations you could use hundreds of thousands of discrete objects for larger simulations. You can't possibly create this many threads on any OS that I know of, and even if you could it would perform like shit!
In your case you could create a number of threads, and put an event loop in each thread. A 'master' thread could sequence the execution and post a 'process' event to each worker thread to wake it up and make it do some work. In that way the threads will sleep until you tell them to work.
You should be able to get the master thread to tick at a rate that allows all your worker threads to complete before the next tick.
I don't think threads are the answer to your problem, with the exception of parallelising into a small number of worker threads (equal to the number of cores in the machine) which each linearly sequence a series of physical objects. You could still use the master/event-driven approach this way, but you would remove a lot of the overhead.
Please don't. Threads are an O/S abstraction permitting the appearance of parallel execution. With multiple and multicore CPU's, the O/S can (but need not) distribute threads among the different cores.
The closest thing to your scalability vision which I see as workable is to use worker threads, dimensioned to roughly match the number of cores you have, and distribute work among them. A rough draft: define a class ActionTick which does the updating for one particle, and let the worker thread pick ActionTicks to process from a shared queue. I see several challenges even with such a solution.
Threading overheads: you get context switching overhead among different worker threads. Threads by themselves are expensive (if not actually as ruinous as processes): test performance with different thread pool sizes. Adding more threads beyond the number of cores tends to reduce performance!
Synchronization costs: you get several spots of contention: access to the work queue for one, but worse, access to the simulated world. You need to delimit the effects of each ActionTick or implement a lot of locking/unlocking.
Difficulty of optimizating the physics. You want to delimit the number of objects/particles each ActionTick looks at (distance cut-off? 3D-tree-subdivision of the simulation space?). Depending on the simulation domain, you may be able to eliminate a lot of work by examining whether any changes is even needed in a subset of items. Doing these kinds of optimizations is easier before queueing work items, rather than as a distributed algorithm. But then that part of your simulation becomes a potential scalability bottleneck.
Complexity. Threading and concurrency introduces several cans of worms to a solution. Always consider other options first -- but if you need them, try threads before creating your own work item scheduling, locking and execution strategies...
Caveat: I haven't worked with any massive simulation software, just some hobbyist code.
As you mention, there are many "DON'T DO THIS" answers. Most seem to read threads as OS threads used by Java. Since you mentioned Erlang in your post, I'd like to post a more Erlang-centered answer.
Modeling this kind of simulation with processes (or actors, micro threads, green threads, as they are sometimes called) doesn't necessarily need any synchronization. In essence, we have a couple of (most likely thousands or hundreds of thousands) physics objects that need to be simulated. We want to simulate these objects as realistically as possible, but there is probably also some kind of real time aspect involved (doesn't have to be though, you don't mention this in your question).
A simple solution would be to spawn of an Erlang process for each object, sent ticks to all of them and collect the results of the simulation before proceeding with the next tick. This is in practice synchronizing everything. It is of course more of a deterministic solution and does not guarantee any real time properties. It is also non-trivial how the processes would talk to each other to get the data they need for the calculations. You probably need to group them in clever ways (collision groups etc), have hibernated processes (which Erlang has neat support for) for sleeping objects, etc to speed things up.
To get real time properties you probably need to restrain the calculations performed by the processes (trading accuracy for speed). This could perhaps be done by sending out ticks without waiting for answers, and letting the object processes reply back to each tick with their current position and other data you need (even though it might only be approximated at the time). As DJClayworth says, this could lead to errors accumulating in the simulation.
I guess in one sense, the question is really about if it is possible to use the strength of concurrency to gain some kind of advantage here. If you need synchronization, it is a quite strong sign that you do not need concurrency between each physics object. Because you essentially throw away a lot of computation time by waiting for other processes. You might use concurrency during calculation but that is another discussion, I think.
Note: none of these ideas take the actual physics calculations into account. This is not Erlang strong side and could perhaps be performed in a C library or whatever strikes your fancy, depending on the type of characteristics you want.
Note: I do not know of any case where this has been done (especially not by me), so I cannot guarantee that this is sound advice.
Even with perfect software, hardware will prevent you doing this. Hardware threads typically don't have fair performance. Over a short period, you are lucky if threads run within +-10% performance.
The are, of course, outliers. Some chipsets will run some cores in powersaving mode and others not. I believe one of the Blue Gene research machines had software controlled scheduling of hardware threads instead of locks.
Erlang will by default try and spread its processes evenly over the available threads. It will also by default try to run threads on all available processors. So if you have enough runnable Erlang processes then you will get a relatively even balance.
I'm not a threading expert, but isn't the whole point of threads that they are independent from each other - and non-deterministic?
I think you have a fundamental misconception in your question where you say:
It would be conceptually very close to how the real world works
The real world does not work in a thread-like way at all. Threads in most machines are not independent and not actually even simultaneous (the OS will use context-switching instead). They provide the most value when there is a lot of IO or waiting occurring.
Most importantly, the real-world does not "consume more resources" as more complex things happen. Think of the difference between two objects falling from a height, one falling smoothly and the other performing some kind of complex tumbling motion...
I would make a kind of "clock generator" - and would register every new object/thread there. The clock will notify all registered objects when the delta-t has passed.
However this does not mean you need a separate thread for every object. Ideally you will have as many threads as processors.
From a design point of you could separate the execution of the object-tasks through an Executor or a thread-pool, e.g. when an object receives the tick event, it goes to a thread pool and schedules itself for execution.
Two things has to happen in order to achieve this. You have to assure thah you have equal number of threads per CPU core, and you need some kind of synchronization.
That sync can be rather simple, like checking "cycle-done" variable for each thread while performing computation, but you can't avoid it.
Working at control for motors i have used some math to maintain velocity at stable state.
The system have PID control, proportional, integral and derivative. But this is analog/digital system. Maybe can use similarly to determine how mush time each thread must run, but the biggest tip I can give you is that all threads will each have a clock synchronization.
I'm first to admit I'm not a threading expert, but this sounds like a very wrong way to approach simulation. As others have already commented having too many threads is computationally expensive. Furthermore, if you are planing to do what I think you are thinking of doing, your simulation may turn out to produce random results (may not matter if you are making a game).
I'd go with a few worker threads used to calculate discrete steps of the simulation.
I recently inherited a small Java program that takes information from a large database, does some processing and produces a detailed image regarding the information. The original author wrote the code using a single thread, then later modified it to allow it to use multiple threads.
In the code he defines a constant;
// number of threads
public static final int THREADS = Runtime.getRuntime().availableProcessors();
Which then sets the number of threads that are used to create the image.
I understand his reasoning that the number of threads cannot be greater than the number of available processors, so set it the the amount to get the full potential out of the processor(s). Is this correct? or is there a better way to utilize the full potential of the processor(s)?
EDIT: To give some more clarification, The specific algorithm that is being threaded scales to the resolution of the picture being created, (1 thread per pixel). That is obviously not the best solution though. The work that this algorithm does is what takes all the time, and is wholly mathematical operations, there are no locks or other factors that will cause any given thread to sleep. I just want to maximize the programs CPU utilization to decrease the time to completion.
Threads are fine, but as others have noted, you have to be highly aware of your bottlenecks. Your algorithm sounds like it would be susceptible to cache contention between multiple CPUs - this is particularly nasty because it has the potential to hit the performance of all of your threads (normally you think of using multiple threads to continue processing while waiting for slow or high latency IO operations).
Cache contention is a very important aspect of using multi CPUs to process a highly parallelized algorithm: Make sure that you take your memory utilization into account. If you can construct your data objects so each thread has it's own memory that it is working on, you can greatly reduce cache contention between the CPUs. For example, it may be easier to have a big array of ints and have different threads working on different parts of that array - but in Java, the bounds checks on that array are going to be trying to access the same address in memory, which can cause a given CPU to have to reload data from L2 or L3 cache.
Splitting the data into it's own data structures, and configure those data structures so they are thread local (might even be more optimal to use ThreadLocal - that actually uses constructs in the OS that provide guarantees that the CPU can use to optimize cache.
The best piece of advice I can give you is test, test, test. Don't make assumptions about how CPUs will perform - there is a huge amount of magic going on in CPUs these days, often with counterintuitive results. Note also that the JIT runtime optimization will add an additional layer of complexity here (maybe good, maybe not).
On the one hand, you'd like to think Threads == CPU/Cores makes perfect sense. Why have a thread if there's nothing to run it?
The detail boils down to "what are the threads doing". A thread that's idle waiting for a network packet or a disk block is CPU time wasted.
If your threads are CPU heavy, then a 1:1 correlation makes some sense. If you have a single "read the DB" thread that feeds the other threads, and a single "Dump the data" thread and pulls data from the CPU threads and create output, those two could most likely easily share a CPU while the CPU heavy threads keep churning away.
The real answer, as with all sorts of things, is to measure it. Since the number is configurable (apparently), configure it! Run it with 1:1 threads to CPUs, 2:1, 1.5:1, whatever, and time the results. Fast one wins.
The number that your application needs; no more, and no less.
Obviously, if you're writing an application which contains some parallelisable algorithm, then you can probably start benchmarking to find a good balance in the number of threads, but bear in mind that hundreds of threads won't speed up any operation.
If your algorithm can't be parallelised, then no number of additional threads is going to help.
Yes, that's a perfectly reasonable approach. One thread per processor/core will maximize processing power and minimize context switching. I'd probably leave that as-is unless I found a problem via benchmarking/profiling.
One thing to note is that the JVM does not guarantee availableProcessors() will be constant, so technically, you should check it immediately before spawning your threads. I doubt that this value is likely to change at runtime on typical computers, though.
P.S. As others have pointed out, if your process is not CPU-bound, this approach is unlikely to be optimal. Since you say these threads are being used to generate images, though, I assume you are CPU bound.
number of processors is a good start; but if those threads do a lot of i/o, then might be better with more... or less.
first think of what are the resources available and what do you want to optimise (least time to finish, least impact to other tasks, etc). then do the math.
sometimes it could be better if you dedicate a thread or two to each i/o resource, and the others fight for CPU. the analisys is usually easier on these designs.
The benefit of using threads is to reduce wall-clock execution time of your program by allowing your program to work on a different part of the job while another part is waiting for something to happen (usually I/O). If your program is totally CPU bound adding threads will only slow it down. If it is fully or partially I/O bound, adding threads may help but there's a balance point to be struck between the overhead of adding threads and the additional work that will get accomplished. To make the number of threads equal to the number of processors will yield peak performance if the program is totally, or near-totally CPU-bound.
As with many questions with the word "should" in them, the answer is, "It depends". If you think you can get better performance, adjust the number of threads up or down and benchmark the application's performance. Also take into account any other factors that might influence the decision (if your application is eating 100% of the computer's available horsepower, the performance of other applications will be reduced).
This assumes that the multi-threaded code is written properly etc. If the original developer only had one CPU, he would never have had a chance to experience problems with poorly-written threading code. So you should probably test behaviour as well as performance when adjusting the number of threads.
By the way, you might want to consider allowing the number of threads to be configured at run time instead of compile time to make this whole process easier.
After seeing your edit, it's quite possible that one thread per CPU is as good as it gets. Your application seems quite parallelizable. If you have extra hardware you can use GridGain to grid-enable your app and have it run on multiple machines. That's probably about the only thing, beyond buying faster / more cores, that will speed it up.