i would like to ask if Java will utilize more CPU resources when threads are blocked, i.e. waiting to lock a monitor which is currently being locked by another thread.
I am now looking at a thread dump whereby some threads are blocked as they are waiting to lock a monitor, and i am unsure if that is what may be accountable for the high CPU usage.
Thanks!
EDIT (6 May 2011) I forgot to mention if this behavior is relevant for Java SE 1.4.2.
Threads consume resources such as memory. A blocking/unblocking thread incurs a once off cost. If a thread blocking/unblocks tens of thousands of times per second this can waste significant amounts of CPU.
However once a thread is blocked, it doesn't matter how long it is blocked for, there is no ongoing cost.
The answer is not so simple. There may be cases where threads that go into the blocked state may end up causing CPU utilization.
Most JVMs employ tiered locking algorithms. The often involve algorithms such as spinlocks especially for locks held for a short duration. When a thread tries to acquire a monitor and finds it cannot, the JVM may actually put it in a loop and have the thread attempt to acquire the monitor, rather than context switching it out immediately. If the thread fails to acquire the lock after a certain number of tries or duration (depending on the specific JVM implementation), the JVM switches to a "fat lock" or "inflated lock" mode where it does context switch out the thread.
It is with the spinlock behavior where you may incur CPU costs. If you have code that holds lock for a very short duration and the contention is high, then you may see appreciable bump in the CPU utilization. For some discussions on various techniques JVMs use to reduce costs on contention, see http://www.ibm.com/developerworks/java/library/j-jtp10185/index.html.
No, threads that are blocked on a monitor do not take up additional CPU time.
Suspended or blocked thread do not consume any CPU time.
Related
AFAIK, Every object in Java has a mark word. The first word (the mark word) is used for storing locking information, either through a flag if only one thread is acquiring the lock or pointing to a lock monitor object if there is contention between different threads, and in both the cases, compare and swap construct is used for acquiring the lock.
But according to this link -
https://www.baeldung.com/lmax-disruptor-concurrency
To deal with the write contention, a queue often uses locks, which can cause a context switch to the kernel. When this happens the processor involved is likely to lose the data in its caches.
What am I missing ?
Neither, synchronized nor the standard Lock implementations, require a context switch into the kernel when locking uncontended or unlocking. These operations indeed boil down to an atomic cas or write.
The performance critical aspect is the contention, i.e. when trying to acquire the monitor or lock and it’s not available. Waiting for the availability of the monitor or lock implies putting the thread into a waiting state and reactivating it when the resource became available. The performance impact is so large, that you don’t need to worry about CPU caches at all.
For this reason, typical implementations perform some amount of spinning, rechecking the availability of the monitor or lock in a loop for some time, when there is a chance of becoming available in that time. This is usually tied to the number of CPU cores. When the resource becomes available in that time, these costs can be avoided. This, however, usually requires the acquisition to be allowed to be unfair, as a spinning acquisition may overtake an already waiting thread.
Note that the linked article says before your cited sentence:
Queues are typically always close to full or close to empty due to the differences in pace between consumers and producers.
In such a scenario, the faster threads will sooner or later enter a condition wait, waiting for new space or new items in a queue, even when they acquired the lock without contention. So in this specific scenario, the associated costs are indeed there and unavoidable when using a simple queue implementation.
The spec for this method: https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/Executors.html#newCachedThreadPool()
Creates a thread pool that creates new threads as needed, but will reuse previously constructed threads when they are available. These pools will typically improve the performance of programs that execute many short-lived asynchronous tasks. Calls to execute will reuse previously constructed threads if available. If no existing thread is available, a new thread will be created and added to the pool. Threads that have not been used for sixty seconds are terminated and removed from the cache. Thus, a pool that remains idle for long enough will not consume any resources. Note that pools with similar properties but different details (for example, timeout parameters) may be created using ThreadPoolExecutor constructors.
It's not clear to me from this description - is it safe to have several of these pools in a single program? Or would I potentially run into a situation where one pool stalls on many threads and freezes up other pools?
I don't think there is a clear yes / no answer on this.
On the one hand, there is not a finite number of threads that ThreadPoolExecutor instances consume. The JVM architecture itself doesn't limit the number of threads.
On the the second hand, the OS / environment may place some limits:
The OS may have hard limits on the total number of native threads it will support.
The OS may restrict the number of native threads that a given process (in this case the JVM) can create. This could be done using ulimit or cgroup limits, and potentially other ways.
A Java thread stack has a size of 1MB (by default) on a typical 64 bit JVM. If you attempt to start() too many threads, you may run out of memory and get an OOME.
If there are a large enough number of threads and/or too much thread context switching, the thread scheduler (in the OS) may struggle.
(Context switching typically happens when a thread does a blocking syscall or has to wait on a lock or a notification. Each time you switch context there are hardware related overheads: saving and restoring registers, switching virtual memory contexts, flushing memory caches, etc.)
On the third hand, there are other things than just the number and size of thread pools that could cause problems. For example, if the thread tasks interact with each other, you could experience problems due to:
deadlocking when locking shared objects,
too much contention on shared locks leading to resource starvation,
too much work leading to timeouts, or
priority inversion problems ... if you try to use priorities to "manage" the workload.
So ...
Is it safe to have several of these pools in a single program?
Or would I potentially run into a situation where one pool stalls on many threads and freezes up other pools.
It is unlikely you would get a "stall" ... unless the tasks are interacting in some way.
But if you have too many runnable threads competing for CPU, each one will get (on average) a smaller share of the finite number of cores available. And lock contention or too much context switching can slow things down further.
My Problem:
Does large numbers of threads in JVM consume a lot of resources (memory, CPU), when the threads are TIMED_WAIT state (not sleeping) >99.9% of the time? When the threads are waiting, how much CPU overhead does it cost to maintain them if any are needed at all?
Does the answer also apply to non-JVM related environments (like linux kernels)?
Context:
My program receives a large number of space consuming packages. It store counts of similar attributes within the different packages. After a given period of time after receiving a package(could be hours or days), that specific package expires and any count the package contributed to should be decremented.
Currently, I achieve these functionalities by storing all the packages in memory or disk. Every 5 minutes, I delete the expired packages from storage, and scan through the remaining packages to count the attributes. This method uses up a lot of memory, and has bad time complexity (O(n) for time and memory where n is the number of unexpired packages). This makes scalability of the program terrible.
One alternative way to approach this problem is to increment the attribute count every time a package comes by and start a Timer() thread that decrements the attribute count after the package expires. This eliminates the need to store all the bulky packages and cut the time complexity to O(1). However, this creates another problem as my program will start having O(n) number of threads, which could cut into performance. Since most of the threads will be in the TIMED_WAIT state (Java’s Timer() invokes the Object.wait(long) method) the vast majority of their lifecycle, does it still impact the CPU in a very large way?
First, a Java (or .NET) thread != a kernel/OS thread.
A Java Thread is a high level wrapper that abstracts some of the functionality of a system thread; these kinds of threads are also known as managed threads. At the kernel level a thread only has 2 states, running and not running. There's some management information (stack, instruction pointers, thread id, etc.) that the kernel keeps track of, but there is no such thing at the kernel level as a thread that is in a TIMED_WAITING state (the .NET equivalent to the WaitSleepJoin state). Those "states" only exists within those kinds of contexts (part of why the C++ std::thread does not have a state member).
Having said that, when a managed thread is being blocked, it's being done so in a couple of ways (depending on how it is being requested to be blocked at the managed level); the implementations I've seen in the OpenJDK for the threading code utilize semaphores to handle the managed waits (which is what I've seen in other C++ frameworks that have a sort of "managed" thread class as well as in the .NET Core libraries), and utilize a mutex for other types of waits/locks.
Since most implementations will utilize some sort of locking mechanism (like a semaphore or mutex), the kernel generally does the same thing (at least where your question is concerned); that is, the kernel will take the thread off of the "run" queue and put it in the "wait" queue (a context switch). Getting into thread scheduling and specifically how the kernel handles the execution of the threads is beyond the scope of this Q&A, especially since your question is in regards to Java and Java can be run on quite a few different types of OS (each of which handles threading completely differently).
Answering your questions more directly:
Does large numbers of threads in JVM consume a lot of resources (memory, CPU), when the threads are TIMED_WAIT state (not sleeping) >99.9% of the time?
To this, there are a couple of things to note: the thread created consumes memory for the JVM (stack, ID, garbage collector, etc.) and the kernel consumes kernel memory to manage the thread at the kernel level. That memory that is consumed does not change unless you specifically say so. So if the thread is sleeping or running, the memory is the same.
The CPU is what will change based on the thread activity and the number of threads requested (remember, a thread also consumes kernel resources, thus has to be managed at a kernel level, so the more threads that have to be handled, the more kernel time must be consumed to manage them).
Keep in mind that the kernel times to schedule and run the threads are extremely minuscule (that's part of the point of the design), but it's still something to consider if you plan on running a lot of threads; additionally, if you know your application will be running on a CPU (or cluster) with only a few cores, the fewer cores you have available to you, the more the kernel has to context switch, adding additional time in general.
When the threads are waiting, how much CPU overhead does it cost to maintain them if any are needed at all?
None. See above, but the CPU overhead used to manage the threads does not change based on the thread context. Extra CPU might be used for context switching and most certainly extra CPU will be utilized by the threads themselves when active, but there's no additional "cost" to the CPU to maintain a waiting thread vs. a running thread.
Does the answer also apply to non-JVM related environments (like linux kernels)?
Yes and no. As stated, the managed contexts generally apply to most of those types of environments (e.g. Java, .NET, PHP, Lua, etc.), but those contexts can vary and the threading idioms and general functionality is dependant upon the kernel being utilized. So while one specific kernel might be able to handle 1000+ threads per process, some might have hard limits, others might have other issues with higher thread counts per process; you'll have to reference the OS/CPU specs to see what kind of limits you might have.
Since most of the threads will be in the TIMED_WAIT state (Java’s Timer() invokes the Object.wait(long) method) the vast majority of their lifecycle, does it still impact the CPU in a very large way?
No (part of the point of a blocked thread), but something to consider: what if (edge case) all (or >50%) of those threads need to run at the exact same time? If you only have a few threads managing your packages, that might not be an issue, but say you have 500+; 250 threads all being woken at the same time would cause massive CPU contention.
Since you haven't posted any code, it's hard to make specific suggestions to your scenario, but one would be inclined to store a structure of attributes as a class and keep that class in a list or hash map that can be referenced in a Timer (or a separate thread) to see if the current time matches the expiration time of the package, then the "expire" code would run. This cuts down the number of threads to 1 and the access time to O(1); but again, without code, that suggestion might not work in your scenario.
Hope that helps.
I am working with Volley library in android for Http communications . By default volley library is keeping 4 threads which take http 'Request' objects(Request object contains all those details for making an http request like url,http method,data to be posted etc) from a BlockingQueue and make http requests concurrently . When I analyze my app requirement, only below 10% of time I will be using the all 4 threads at a time , and rest of the time I will be using 1 or 2 threads from that thread pool. So in effect 2 to 3 threads will be at wait() mode almost 90% of the time .
So here is my question,
1) What is the overhead of a thread which is in wait() mode , does it consume a significant amount of cpu cycles? and is it a good idea for me to keep all those threads in wait.
I assume that since a waiting thread will be continuously checking on a monitor/lock in a loop or so(internal implementation) to wake up ,it might consume a considerable amount of cpu cycles to maintain a waiting thread . Correct me if I am wrong.
Thanks .
What is the overhead of a thread which is in wait() mode
None. Waiting thread doesn't consume any CPU cycles at all, it just waits for being awakened. So don't bother yourself.
I assume that since a waiting thread will be continuously polling on a monitor/lock internally to wake up ,it might consume a considerable amount af cpu cycles to maintain a waiting thread . Correct me if I am wrong.
That's not true. A waiting thread doesn't do any polling on a monitor/ lock/ anything.
The only situation where a big number of threads can hurt performance is where there is many active threads (much more than nr of CPUs/ cores) which are often switched back and forth. Because CPU context switching also comes with some cost. Waiting threads only consumes memory, not CPU.
If you want to look at the internal implementation of threads - I have to disappoint you. Methods like wait()/ notify() are native - which means that their implementation depends on the JVM. So in case of the HotSpot JVM you can take a look at its source code (written in C++/ with a bit of the assembler).
But do you really need this? Why you don't want to trust a JVM documentation?
Many times I've heard that it is better to maintain the number of threads in a thread pool below the number of cores in that system. Having twice or more threads than the number of cores is not only a waste, but also could cause performance degradation.
Are those true? If not, what are the fundamental principles that debunk those claims (specifically relating to java)?
Many times I've heard that it is better to maintain the number of threads in a thread pool below the number of cores in that system. Having twice or more threads than the number of cores is not only a waste, but also could cause performance degradation.
The claims are not true as a general statement. That is to say, sometimes they are true (or true-ish) and other times they are patently false.
A couple things are indisputably true:
More threads means more memory usage. Each thread requires a thread stack. For recent HotSpot JVMs, the minimum thread stack size is 64Kb, and the default can be as much as 1Mb. That can be significant. In addition, any thread that is alive is likely to own or share objects in the heap whether or not it is currently runnable. Therefore is is reasonable to expect that more threads means a larger memory working set.
A JVM cannot have more threads actually running than there are cores (or hyperthread cores or whatever) on the execution hardware. A car won't run without an engine, and a thread won't run without a core.
Beyond that, things get less clear cut. The "problem" is that a live thread can in a variety of "states". For instance:
A live thread can be running; i.e. actively executing instructions.
A live thread can be runnable; i.e. waiting for a core so that it can be run.
A live thread can by synchronizing; i.e. waiting for a signal from another thread, or waiting for a lock to be released.
A live thread can be waiting on an external event; e.g. waiting for some external server / service to respond to a request.
The "one thread per core" heuristic assumes that threads are either running or runnable (according to the above). But for a lot of multi-threaded applications, the heuristic is wrong ... because it doesn't take account of threads in the other states.
Now "too many" threads clearly can cause significant performance degradation, simple by using too much memory. (Imagine that you have 4Gb of physical memory and you create 8,000 threads with 1Mb stacks. That is a recipe for virtual memory thrashing.)
But what about other things? Can having too many threads cause excessive context switching?
I don't think so. If you have lots of threads, and your application's use of those threads can result in excessive context switches, and that is bad for performance. However, I posit that the root cause of the context switched is not the actual number of threads. The root of the performance problems are more likely that the application is:
synchronizing in a particularly wasteful way; e.g. using Object.notifyAll() when Object.notify() would be better, OR
synchronizing on a highly contended data structure, OR
doing too much synchronization relative to the amount of useful work that each thread is doing, OR
trying to do too much I/O in parallel.
(In the last case, the bottleneck is likely to be the I/O system rather than context switches ... unless the I/O is IPC with services / programs on the same machine.)
The other point is that in the absence of the confounding factors above, having more threads is not going to increase context switches. If your application has N runnable threads competing for M processors, and the threads are purely computational and contention free, then the OS'es thread scheduler is going to attempt to time-slice between them. But the length of a timeslice is likely to be measured in tenths of a second (or more), so that the context switch overhead is negligible compared with the work that a CPU-bound thread actually performs during its slice. And if we assume that the length of a time slice is constant, then the context switch overhead will be constant too. Adding more runnable threads (increasing N) won't change the ratio of work to overhead significantly.
In summary, it is true that "too many threads" is harmful for performance. However, there is no reliable universal "rule of thumb" for how many is "too many". And (fortunately) you generally have considerable leeway before the performance problems of "too many" become significant.
Having fewer threads than cores generally means you can't take advantage of all available cores.
The usual question is how many more threads than cores you want. That, however, varies, depending on the amount of time (overall) that your threads spend doing things like I/O vs. the amount of time they spend doing computation. If they're all doing pure computation, then you'd normally want about the same number of threads as cores. If they're doing a fair amount of I/O, you'd typically want quite a few more threads than cores.
Looking at it from the other direction for a moment, you want enough threads running to ensure that whenever one thread blocks for some reason (typically waiting on I/O) you have another thread (that's not blocked) available to run on that core. The exact number that takes depends on how much of its time each thread spends blocked.
That's not true, unless the number of threads is vastly more than the number of cores. The reasoning is that additional threads will mean additional context switches. But it's not true because an operating system will only make unforced context switches if those context switches are beneficial, and additional threads don't force additional context switches.
If you create an absurd number of threads, that wastes resources. But none of this is anything compared to how bad creating too few threads is. If you create too few threads, an unexpected block (such as a page fault) can result in CPUs sitting idle, and that swamps any possible harm from a few extra context switches.
Not exactly true, this depends on the overall software architecture. There's a reason of keeping more threads than available cores in case some of the threads are suspended by the OS because they're waiting for an I/O to complete. This may be an explicit I/O invocation (such as synchronous reading from file), as well as implicit, such as system paging handling.
Actually I've read in one book that keeping the number of threads twice the number of CPU cores is is a good practice.
For REST API calls or say I/O-bound operations, having more threads than the number of cores can potentially improve the performance by allowing multiple API requests to be processed in parallel. However, the optimal number of threads depends on various factors such as the API request frequency, the complexity of the request processing, and the resources available on the server.
If the API request processing is CPU-bound and requires a lot of computation, having too many threads may cause resource contention and lead to reduced performance. In such cases, the number of threads should be limited to the number of cores available.
On the other hand, if the API request processing is I/O-bound and involves a lot of waiting for responses from external resources such as databases, having more threads may improve performance by allowing multiple requests to be processed in parallel.
In any case, it is recommended to perform performance testing to determine the optimal number of threads for your specific use case and monitor the system performance using metrics such as response time, resource utilization, and error rate.