I have this statement, which came from Goetz's Java Concurrency In Practice:
Runtime overhead of threads due to context switching includes saving and restoring execution context, loss of locality, and CPU time spent scheduling threads instead of running them.
What is meant by "loss of locality"?
When a thread works, it often reads data from memory and from disk. The data is often stored in contiguous or close locations in memory/on the disk (for example, when iterating over an array, or when reading the fields of an object). The hardware bets on that by loading blocks of memory into fast caches so that access to contiguous/close memory locations is faster.
When you have a high number of threads and you switch between them, those caches often need to be flushed and reloaded, which makes the code of a thread take more time than if it was executed all at once, without having to switch to other threads and come back later.
A bit like we humans need some time to get back to a task after being interrupted, find where we were, what we were doing, etc.
Just to elaborate the point of "cache miss" made by JB Nizet.
As a thread runs on a core, it keeps recently used data in the L1/L2 cache which are local to the core. Modern processors typically read data from L1/L2 cache in about 5-7 ns.
When, after a pause (from being interrupted, put on wait queue etc) a thread runs again, it most likely will run on a different core. This means that the L1/L2 cache of this new core has no data related to the work that the thread was doing. It now needs to goto the main memory (which takes about 100 ns) to load data before proceeding to work.
There are ways to mitigate this issue by pinning threads to a specific core by using a thread affinity library.
Related
In numerous articles, YouTube videos, etc., I have seen Java's volatile keyword explained as a problem of cache memory, where declaring a variable volatile ensures that reads/writes are forced to main memory, and not cache memory.
It has always been my understanding that in modern CPUs, cache memory implements coherency protocols that guarantee that reads/writes are seen equally by all processors, cores, hardware threads, etc, at all levels of the cache architecture. Am I wrong?
jls-8.3.1.4 simply states
A field may be declared volatile, in which case the Java Memory Model ensures that all threads see a consistent value for the variable (§17.4).
It says nothing about caching. As long as cache coherency is in effect, volatile variables simply need to be written to a memory address, as opposed to being stored locally in a CPU register, for example. There may also be other optimizations that need to be avoided to guarantee the contract of volatile variables, and thread visibility.
I am simply amazed at the number of people who imply that CPUs do not implement cache coherency, such that I have been forced to go to StackOverflow because I am doubting my sanity. These people go to a great deal of effort, with diagrams, animated diagrams, etc. to imply that cache memory is not coherent.
jls-8.3.1.4 really is all that needs to be said, but if people are going to explain things in more depth, wouldn't it make more sense to talk about CPU Registers (and other optimizations) than blame CPU Cache Memory?
CPUs are very, very fast. That memory is physically a few centimeters away. Let's say 15 centimeters.
The speed of light is 300,000 kilometers per second, give or take. That's 30,000,000,000,000 centimeters every second. The speed of light in medium is not as fast as in vacuum, but it's close so lets ignore that part. That means sending a single signal from the CPU to the memory, even if the CPU and memory both can instantly process it all, is already limiting you to 1,000,000,000 or 1Ghz (You need to cover 30 centimeters to get form the core to the memory and back, so you can do that 1,000,000,000 every second. If you can do it any faster, you're travelling backwards in time. Or some such. You get a nobel prize if you figure out how to manage that one).
Processors are about that fast! We measure core speeds in Ghz these days, as in, in the time it takes the signal to travel, the CPU's clock has already ticked. In practice of course that memory controller is not instantaneous either, nor is the CPU pipelining system.
Thus:
It has always been my understanding that in modern CPUs, cache memory implements coherency protocols that guarantee that reads/writes are seen equally by all processors, cores, hardware threads, etc, at all levels of the cache architecture. Am I wrong?
Yes, you are wrong. QED.
I don't know why you think that or where you read that. You misremember, or you misunderstood what was written, or whatever was written was very very wrong.
In actual fact, an actual update to 'main memory' takes on the order of a thousand cycles! A CPU is just sitting there, twiddling its thumbs, doing nothing, in a time window where it could roll through a thousand, on some cores, multiple thousands of instructions, memory is that slow. Molasses level slow.
The fix is not registers, you are missing about 20 years of CPU improvement. There aren't 2 layers (registers, then main memory), no. There are more like 5: Registers, on-die cache in multiple hierarchical levels, and then, eventually, main memory. To make it all very very fast these things are very very close to the core. So close, in fact, that each core has their own, and, drumroll here - modern CPUs cannot read main memory. At all. They are entirely incapable of it.
Instead what happens is that the CPU sees you write or read to main memory and translates that, as it can't actually do any of that, by figuring out which 'page' of memory that is trying to read/write to (each chunk of e.g. 64k worth of memory is a page; actual page size depends on hardware). The CPU then checks if any of the pages loaded in its on-die cache is that page. If yes, great, and it's all mapped to that. Which does mean that, if 2 cores both have that page loaded, they both have their own copy, and obviously anything that one core does to its copy is entirely invisible to the other core.
If the CPU does -not- find this page in its own on-die cache you get what's called a cache miss, and the CPU will then check which of its loaded pages is least used, and will purge this page. Purging is 'free' if the CPU hasn't modified it, but if that page is 'dirty', it will first send a ping to the memory controller followed by blasting the entire sequence of 64k bytes into it (because sending a burst is way, way faster than waiting for the signal to bounce back and forth or to try to figure out which part of that 64k block is dirty), and the memory controller will take care of it. Then, that same CPU pings the controller to blast the correct page to it and overwrites the space that was just purged out. Now the CPU 'retries' the instruction, and this time it does work, as that page IS now in 'memory', in the sense that the part of the CPU that translates the memory location to cachepage+offset now no longer throws a CacheMiss.
And whilst all of that is going on, THOUSANDS of cycles can pass, because it's all very very slow. Cache misses suck.
This explains many things:
It explains why volatile is slow and synchronized is slower. Dog slow. In general if you want big speed, you want processes that run [A] independent (do not need to share memory between cores, except at the very start and very end perhaps to load in the data needed to operate on, and to send out the result of the complex operation), and [B] fit all memory needs to perform the calculation in 64k or so, depending on CPU cache sizes and how many pages of L1 cache it has.
It explains why one thread can observe a field having value A and another thread observes the same field having a different value for DAYS on end if you're unlucky. If the cores aren't doing all that much and the threads checking the values of those fields does it often enough, that page is never purged, and the 2 cores go on their merry way with their local core value for days. A CPU doesn't sync pages for funsies. It only does this if that page is the 'loser' and gets purged.
It explains why Spectre happened.
It explains why LinkedList is slower than ArrayList even in cases where basic fundamental informatics says it should be faster (big-O notation, analysing computational complexity). Because as long as the arraylist's stuff is limited to a single page you can more or less consider it all virtually instant - it takes about the same order of magnitude to fly through an entire page of on-die cache as it takes for that same CPU to wait around for a single cache miss. And LinkedList is horrible on this front: Every .add on it creates a tracker object (the linkedlist has to store the 'next' and 'prev' pointers somewhere!) so for every item in the linked list you have to read 2 objects (the tracker and the actual object), instead of just the one (as the arraylist's array is in contiguous memory, that page is worst-case scenario read into on-die once and remains active for your entire loop), and it's very easy to end up with the tracker object and the actual object being on different pages.
It explains the Java Memory Model rules: Any line of code may or may not observe the effect of any other line of code on the value of any field. Unless you have established a happens-before/happens-after relationship using any of the many rules set out in the JMM to establish these. That's to give the JVM the freedom to, you know, not run literally 1000x slower than neccessary, because guaranteeing consistent reads/writes can only be done by flushing memory on every read, and that is 1000x slower than not doing that.
NB: I have massively oversimplified things. I do not have the skill to fully explain ~20 years of CPU improvements in a mere SO answer. However, it should explain a few things, and it is a marvellous thing to keep in mind as you try to analyse what happens when multiple java threads try to write/read to the same field and you haven't gone out of your way to make very very sure you have an HB/HA relationship between the relevant lines. If you're scared now, good. You shouldn't be attempting to communicate between 2 threads often, or even via fields, unless you really, really know what you are doing. Toss it through a message bus, use designs where the data flow is bounded to the start and end of the entire thread's process (make a job, initialize the job with the right data, toss it in an ExecutorPool queue, set up that you get notified when its done, read out the result, don't ever share anything whatsoever with the actual thread that runs it), or talk to each other via the database.
Quick Background: As I am going back and redesigning some critical parts of an application, I keep wondering about locking and its impact on performance. The app has a large Tree style data structure which caches data/DTO from the database. Updates to the large tree can come about in two main ways: 1. user triggered commands, 2. auto updates from jobs that ran in the background.
When either operation type occurs (user/auto), I am locking down (explicitly locking) the data structure. I was running into consistency issues, so locking down everything seemed to make the most sense to protect the integrity of the data in the cache.
Question: Since many auto updates can occur at once I was thinking of implementing some kind of queue (JMS maybe) to handle instructions to the data structure, where any user driven updates get pushed to the top and handled first. When it comes to handling a bulk/unknown size set of auto "tasks", I am trying to figure out if I should let them run and lock individually or try and bulk them together by time and interact with locking once. The real crux of the problem is that any one of the tasks to update could affect the entire tree.
In terms of overall performance (general, nothing specific), is it more efficient to have many transactions locking potentially doing large updates, or try and combine to one massive bulk update and only lock once but for a lot longer? I know a lot of this probably hinges on the data, the type of updates, frequency, etc. I didn't know if there was a general rule of thumb of "smaller more frequent locks" or "one large potentially longer" lock.
I think the answer depends on whether your program spends any significant time with the data structure unlocked. If it does not, I recommend locking once for all pending updates.
The reason is, that other threads that may be waiting for the lock may get woken up and then uselessly sent back to sleep when the update thread quickly locks the resource again. Or the update is interrupted by another thread which is likely bad for cache utilization. Also there is a cost to locking which may be small compared to your update: pipelines may have to be flushed, memory accesses may not be freely reordered, etc.
If the thread spends some time between updates without having to lock the data structure, I would consider relocking for every update if it is expected that other threads can complete their transactions inbetween and contention is thereby reduced.
Note that when there are different priorities for different updates like I presume for your user updates versus the background updates, it may be a bad idea to lock down the data structure for a long time for lower priority updates if this could in any way prevent higher priority tasks from running.
If you end up implementing some kind of a queue, then you lose all concurrency. If you get 1000 requests at once, think of how inefficient that is.
Try taking a look at this code for concurrent trees.
https://github.com/npgall/concurrent-trees
In a single thread program, how are changes made by thread in core 1 made visible to another core 2, so that after a context switch the thread (now running on core 2) will have updated value?
Consider the following example:
The value in main memory for variable x is 10.
The thread runs on core 1 and changes x to 5, which is still in cache and not yet flushed to main memory as we are not using any memory barrier.
A context switch occurs and the thread moves from core 1 to core 2.
The thread reads the value of x.
What would be the value of x if thread resumes execution in core 2 after the context switch?
If "cache coherence" manages consistency to handle a case like above then why do we need for explicit locking (or any read/write barrier) in a multi-threaded program?
Considering your first question, context switches also preserve the register contents. Therefore, the threads sees the latest value, even if moved to another core (or CPU).
However for a multi-threaded program, CPU registers are distinct for different threads (regardless on how many cores the threads are executed), and registers are not part of cache coherency.
Therefore, I think, a multi-threaded program does need to make sure the values in the registers are up-to-date with the values in the main memory.
(Cache coherence only makes sure that the CPU cache is up-to-date with the memory).
Therefore, I suppose, you need a barrier to synchronize the register with the memory.
You can understand it as this: the program essentially operates only on the main memory. However, compilers optimise access to main memory and use registers for intermediate operations.
Thus, the program access only memory and registers.
However, the CPU also introduces its own cache of the memory.
Reads and writes from/to the memory are internally (by the CPU) optimised by the cache.
Cache coherency only ensures within the CPU, that the cache is up-to-date (and therefore, a program accessing memory gets the correct value.)
To sum up:
Cache coherence ensures cache and memory are up-to-date, it is out of the control of the program, as it is internal to the CPU.
Context switches are handled by the operating system, which ensures correct values of registers when it moves threads to different cores.
Memory barriers ensure that the registers and memory are up-to-date, this is what the program has to ensure.
My Problem:
Does large numbers of threads in JVM consume a lot of resources (memory, CPU), when the threads are TIMED_WAIT state (not sleeping) >99.9% of the time? When the threads are waiting, how much CPU overhead does it cost to maintain them if any are needed at all?
Does the answer also apply to non-JVM related environments (like linux kernels)?
Context:
My program receives a large number of space consuming packages. It store counts of similar attributes within the different packages. After a given period of time after receiving a package(could be hours or days), that specific package expires and any count the package contributed to should be decremented.
Currently, I achieve these functionalities by storing all the packages in memory or disk. Every 5 minutes, I delete the expired packages from storage, and scan through the remaining packages to count the attributes. This method uses up a lot of memory, and has bad time complexity (O(n) for time and memory where n is the number of unexpired packages). This makes scalability of the program terrible.
One alternative way to approach this problem is to increment the attribute count every time a package comes by and start a Timer() thread that decrements the attribute count after the package expires. This eliminates the need to store all the bulky packages and cut the time complexity to O(1). However, this creates another problem as my program will start having O(n) number of threads, which could cut into performance. Since most of the threads will be in the TIMED_WAIT state (Java’s Timer() invokes the Object.wait(long) method) the vast majority of their lifecycle, does it still impact the CPU in a very large way?
First, a Java (or .NET) thread != a kernel/OS thread.
A Java Thread is a high level wrapper that abstracts some of the functionality of a system thread; these kinds of threads are also known as managed threads. At the kernel level a thread only has 2 states, running and not running. There's some management information (stack, instruction pointers, thread id, etc.) that the kernel keeps track of, but there is no such thing at the kernel level as a thread that is in a TIMED_WAITING state (the .NET equivalent to the WaitSleepJoin state). Those "states" only exists within those kinds of contexts (part of why the C++ std::thread does not have a state member).
Having said that, when a managed thread is being blocked, it's being done so in a couple of ways (depending on how it is being requested to be blocked at the managed level); the implementations I've seen in the OpenJDK for the threading code utilize semaphores to handle the managed waits (which is what I've seen in other C++ frameworks that have a sort of "managed" thread class as well as in the .NET Core libraries), and utilize a mutex for other types of waits/locks.
Since most implementations will utilize some sort of locking mechanism (like a semaphore or mutex), the kernel generally does the same thing (at least where your question is concerned); that is, the kernel will take the thread off of the "run" queue and put it in the "wait" queue (a context switch). Getting into thread scheduling and specifically how the kernel handles the execution of the threads is beyond the scope of this Q&A, especially since your question is in regards to Java and Java can be run on quite a few different types of OS (each of which handles threading completely differently).
Answering your questions more directly:
Does large numbers of threads in JVM consume a lot of resources (memory, CPU), when the threads are TIMED_WAIT state (not sleeping) >99.9% of the time?
To this, there are a couple of things to note: the thread created consumes memory for the JVM (stack, ID, garbage collector, etc.) and the kernel consumes kernel memory to manage the thread at the kernel level. That memory that is consumed does not change unless you specifically say so. So if the thread is sleeping or running, the memory is the same.
The CPU is what will change based on the thread activity and the number of threads requested (remember, a thread also consumes kernel resources, thus has to be managed at a kernel level, so the more threads that have to be handled, the more kernel time must be consumed to manage them).
Keep in mind that the kernel times to schedule and run the threads are extremely minuscule (that's part of the point of the design), but it's still something to consider if you plan on running a lot of threads; additionally, if you know your application will be running on a CPU (or cluster) with only a few cores, the fewer cores you have available to you, the more the kernel has to context switch, adding additional time in general.
When the threads are waiting, how much CPU overhead does it cost to maintain them if any are needed at all?
None. See above, but the CPU overhead used to manage the threads does not change based on the thread context. Extra CPU might be used for context switching and most certainly extra CPU will be utilized by the threads themselves when active, but there's no additional "cost" to the CPU to maintain a waiting thread vs. a running thread.
Does the answer also apply to non-JVM related environments (like linux kernels)?
Yes and no. As stated, the managed contexts generally apply to most of those types of environments (e.g. Java, .NET, PHP, Lua, etc.), but those contexts can vary and the threading idioms and general functionality is dependant upon the kernel being utilized. So while one specific kernel might be able to handle 1000+ threads per process, some might have hard limits, others might have other issues with higher thread counts per process; you'll have to reference the OS/CPU specs to see what kind of limits you might have.
Since most of the threads will be in the TIMED_WAIT state (Java’s Timer() invokes the Object.wait(long) method) the vast majority of their lifecycle, does it still impact the CPU in a very large way?
No (part of the point of a blocked thread), but something to consider: what if (edge case) all (or >50%) of those threads need to run at the exact same time? If you only have a few threads managing your packages, that might not be an issue, but say you have 500+; 250 threads all being woken at the same time would cause massive CPU contention.
Since you haven't posted any code, it's hard to make specific suggestions to your scenario, but one would be inclined to store a structure of attributes as a class and keep that class in a list or hash map that can be referenced in a Timer (or a separate thread) to see if the current time matches the expiration time of the package, then the "expire" code would run. This cuts down the number of threads to 1 and the access time to O(1); but again, without code, that suggestion might not work in your scenario.
Hope that helps.
My Java program uses java.util.concurrent.Executor to run multiple threads, each one starts a runnable class, in that class it reads from a comma delimited text file on C: drive and loops through the lines to split and parse text into floats, after that data is stored into :
static Vector
static ConcurrentSkipListMap
My PC is a Win 7 64bit, Intel Core i7, has six * 2 cores and 24GB of RAM, I have noticed the program will run for 2 minutes and finish all 1700 files, but the CPU usage is only around 10% to 15%, no matter how many threads I assign using :
Executor executor=Executors.newFixedThreadPool(50);
Executors.newFixedThreadPool(500) won't have a better CPU usage or shorter time to finish the tasks. There is no network traffic, everything is on local C: drive, There is enough RAM for more threads to use, it will have an "OutOfMemoryError" when I increase the threads to 1000.
How come more threads doesn't translate to more CPU usage and less time of processing, why ?
Edit : My hard drive is a SSD 200 GB.
Edit : Finally found where the problem was, each thread writes it's results to a log file which is shared by all threads, the more times I run the app, the larger the log file, the slower it gets, and since it's shared, this definitely slows down the process, so after I stopped writing to the log file, it finishes all tasks in 10 seconds !
The OutOfMemoryError is probably coming from Java's own limits on its memory usage. Try using some of the arguments here to increase the maximum memory.
For speed, Adam Bliss starts with a good suggestion. If this is the same file over and over, then I imagine having multiple threads try to read it at the same time could result in a lot of contention over locks on the file. More threads would even mean more contention, which could even result in worse overall performance. So avoid that and simply load the file once if it's possible. Even if it's a large file, you have 24 GB of RAM. You can hold quite a large file, but you may need to increase the JVM's allowed memory to allow the whole file to be loaded.
If there are multiple files being used, then consider this fact: your disk can only read one file at a time. So having multiple threads trying to use the disk all at the same time probably won't be too effective if the threads aren't spending much time processing. Since you have so little CPU usage, it could be that the thread loads part of the file, then runs very quickly on the part that got buffered, and then spends a lot of time waiting for the rest of the file to load. If you're loading the file over and over, that could even still apply.
In short: Disk IO probably is your culprit. You need to work to reduce it so that the threads aren't contending for file content so much.
Edit:
After further consideration, it's more likely a synchronization issue. Threads are probably getting held up trying to add to the result list. If access is frequent, this will result in huge amounts of contention for locks on the object. Consider doing something like having each thread save it's results in a local list (like ArrayList, which is not thread safe), and then copying all values into the final, shared list in chunks to try to reduce contention.
You're probably being limited by IO, not cpu.
Can you reduce the number of times you open the file to read it? Maybe open it once, read all the lines, keep them in memory, and then iterate on that.
Otherwise, you'll have to look at getting a faster hard drive. SSDs can be quite speedy.
It is possible that your threads are somehow given low priority on the system? Increasing the number of threads in that case wouldn't correspond to an increase in CPU usage, since the amount of CPU space allotted to your program may be throttled somewhere else.
Are there any configuration files/ initialization steps where something like this could possibly occur?