I've the following java code. Will there be any "context switching" occuring during its execution?
Collection<MyBusinessClass> myCollection = getMyCollection();//has 1000 items
for (MyBusinessClass item : myCollection) {
new Thread(() -> {
MyLongRunningTask();
}).start();
Thanks.
Unless you have enough cores to host all the 1000 threads (+ 1 main briefly) + whatever few threads JVM needs like GC and finalizer, your threads will have to share cores. And hence context switches will occur. I am assuming here that MyLongRunningTask actually runs for a sufficiently long time for all of them to be still active by the time your last ones are spawned, otherwise probably the required number of available cores is a bit lower.
We can try to concoct a scenario how the scheduler actually runs ALL tasks sequentially, never 'overlapping', by having very short tasks (or a fairly crazy scheduler) so you can get away with a small number of CPU cores. But that seems off the point.
Related
I have large array of type C and a pool of threads. Each thread has a range of indexes (they don't overlap) and does some CPU bound operations to populate them.
After submission of tasks to the executor (created with newFixedThreadPool) I monitor the output of 'top' command and can notice that the cpu spends significant amount of time in kernel space ("%sy" in 'top' output) - between 15 and 25% - during the execution of those tasks (before it is low and after it decreases again).
On some test runs it does happen that "%sy" stays close to 0 and then the execution is much faster.
The number of threads is equal to the number of logical cpus on the test machine and this is also the number of tasks that I submit to the executor (so it's like 1 thread - 1 CPU bound task). Therefore I wouldn't expect here a lot of context switching.
In this part of code there is no explicit synchronization done by me, I rely only on the guarantees provided by the executor service as the threads don't share any variables.
Operating system is Amazon Linux AMI 2014.09, the program runs on Java 8.
Any ideas why this could happen? How I can debug such issue?
You might need to use a Profiler
I'm experiencing some strange behaviour in a java program. Basically, I have a list of items to process, which I can choose to process one at a time, or all at once (which means 3-4 at a time). Each item needs about 10 threads to be processed, so processing 1 item at a time = 10 threads, 2 at a time = 20 threads, 4 at a time = 40 threads, etc.
Here's the strange thing, if I process just one item, its done in approx 50-150 milliseconds. But if I process 2 at a time, it goes up to 200-300 ms per item. 3 at a time = 300-500MS per item, 4 at a time = 400-700 MS per item, etc.
Why is this happening? I've done prior research which says that jvm can handle upto 3000-4000 threads easily, so why does it slow down with just 30-40 threads for me? Is this normal behavior? I thought that having 40 threads would mean each thread would work in parallel rather than in a queue as it seems to be.
How many CPU cores do you have?
If I have one CPU core, and I max out a single threaded application on it, the CPU is always busy, if I give it two threads, both doing this heavy task I don't get double-the-cpu, no, they both get ~0.5 seconds / second (seconds per second) of CPU time take away the time the OS needs to switch threads.
So it doubles the time taken for each thread to work, but they might finish at about the same time (depending on the scheduler)
If you have two CPU cores.... then it'd (theoretically again) finish in the same time as one thread, because one thread can't use two cpu cores (at the same time)
Then there's hardware threads, some threads yield or sleep, if they're reading/writing the OS will run other threads while they are blocked, so forth....
Does this help?
It would be nice to see some source code.
Without it i have only 4 assumptions :
1) You haven't done the load balancing. You should consider about optimal number of threads.
2) Work, executed by each thread does not justify the time, needed to setup and start the thread (+ context switching time).
3) There is the real problems with your code quality
4) Weak hardware
Just wondering what is the best way to decide when to stop creating new threads on a single-core machine which is running the same program multiple times as a thread?
The threads are fetching web content and doing a bit of processing, which means the load of each thread is not constant all the way until the thread terminates.
I'm thinking to have a thread which monitors the CPU/RAM load, and stop creating threads if the load reaches a certain treshold, but also stop creating threads if a certain threads count has been reached, to make sure the CPU doesn't get overloaded.
Any feedback on what techniques are out there to achieve this?
Many thanks,
Vladimir
It is going to be difficult to do this by monitoring the CPU used by the current process. Those numbers tend to lag reality and the result is going to be peaks and valleys to a large degree. The problem is that your threads are mostly going to be blocked by IO and there is not any good way to anticipate when bytes will be available to be read in the near future.
That said, you could start out with a ThreadPoolExecutor at a certain max thread number (for a single processor let's say 4) and then check every 10 seconds or so the load average. If the load average is below what you want then you could call setMaximumPoolSize(...) with a larger value to increase it for the next 10 seconds. You may need to poll 30 or more seconds between each calculation to smooth out the performance of your application.
You could use the following code to track your total CPU time for all threads. Not sure if that's the best way to do it
long total = 0;
for (long id : threadMxBean.getAllThreadIds()) {
long cpuTime = threadMxBean.getThreadCpuTime(id);
if (cpuTime > 0) {
total += cpuTime;
}
}
// since is in nano-seconds
long currentCpuMillis = total / 1000000;
Instead of trying to maximize the CPU level for your spider, you might consider trying to maximize throughput. Take the sample of the number of pages spidered per a unit of time and increase or decrease the max number of threads in your ExecutorService until this is maximized.
One thing to consider is to use NIO and selectors so your threads are always busy as opposed to always waiting for IO. Here's a good example tutorial about NIO/Selectors. You might also consider using Pyronet which seems to provide some good features around NIO.
If async I/O is not a good fit, I would consider using thread pools, e.g. ThreadPoolExecutor, so you don't have the overhead of creating, destroying and recreating threads.
Then I would do performance testing to tweak the max number of threads offers the best performance.
You could start with 10 threads, then rerun your performance test with 20 threads until you hone in on an optimal value. At the same time I would use system tools (depending on your OS) to monitor the thread run queue, JVM, etc.
For the performance test you would have to ensure that your test is repeatable (i.e. using the same inputs) and representative of the actual input that your program would be using.
Let's assume each thread is doing some FP calculation, I am interested in
how much time the CPU is used in switching threads instead of running them
how much synchronization traffic is created on shared memory bus - when threads share data, they must use synchronization mechanism
My question: how to design a test program to get this data?
You can't easily differentiate the waste due to thread-switching and that due to memory cache contention. You CAN measure the thread contention.. Namely, on linux, you can cat /proc/PID/XXX and get tons of detailed per-thread statistics. HOWEVER, since the pre-emptive scheduler is not going to shoot itself in the foot, you're not going to get more than say 30 ctx switches per second no matter how many threads you use.. And that time is going to be relatively small v.s. the amount of work you're doing.. The real cost of context-switching is the cache pollution. e.g. there is a high probability that you'll have mostly cache misses once you're context-switched back in. Thus OS time and context-switch-counts are of minimal value.
What's REALLY valuable is the ratio of inter-thread cache-line dirties. Depending on the CPU, a cache-line dirty followed by a peer-CPU read is SLOWER than a cache-miss - because you have to force the peer CPU to write it's value to main-mem before you can even start reading.. Some CPUs let you pull from peer cache-lines without hitting main-mem.
So the key is the absolutely minimize ANY shared modified memory structures.. Make everything as read-only as possible.. This INCLUDES share FIFO buffers (including Executor pools).. Namely if you used a synchronized queue - then every sync-op is a shared dirty memory region. And more-over, if the rate is high enough, it'll likely trigger an OS trap to stall, waiting for peer thread's mutex's.
The ideal is to segment RAM, distribute to a fixed number of workers a single large unit of work, then use a count-down-latch or some other memory barrier (such that each thread would only touch it once). Ideally any temporary buffers are pre-allocated instead of going into and out of a shared memory pool (which then causes cache contention). Java 'synchronized' blocks leverage (behind the scenes) a shared hash-table memory space and thus trigger the undesirable dirty-reads, I haven't determined if java 5 Lock objects avoid this, but you're still leveraging OS stalls which won't help in your throughput. Obviously most OutputStream operations trigger such synchronized calls (and of course are typically filling a common stream buffer).
Generally my experience is that single-threading is faster than mulithreading for a common byte-array/object-array, etc. At least with simplistic sorting/filtering algorithms that I've experimented with. This is true both in Java and C in my experience. I haven't tried FPU intesive ops (like divides, sqrt), where cache-lines may be less of a factor.
Basically if you're a single CPU you don't have cache-line problems (unless the OS is always flushing the cache even in shared threads), but multithreading buys you less than nothing. In hyperthreading, it's the same deal. In single-CPU shared L2/L3 cache configurations (e.g. AMDs), you might find some benefit. In multi CPU Intel BUS's, forget it - shared write-memory is worse than single-threading.
To measure how much time a context switch takes I would run something like the following:
public static void main(String[] args) {
Object theLock = new Object();
long startTime;
long endtime;
synchronized( theLock ){
Thread task = new TheTask( theLock );
task.start();
try {
theLock.wait();
endTime = System.currentTimeMillis();
}
catch( InterruptedException e ){
// do something if interrupted
}
}
System.out.println("Context Switch Time elapsed: " + endTime - startTime);
}
class TheTask extends Thread {
private Object theLock;
public TheTask( Object theLock ){
this.theLock = theLock;
}
public void run(){
synchronized( theLock ){
startTime = System.currentTimeMillis();
theLock.notify();
}
}
}
You might want to run this code several times to get an average and make sure these two threads are the only ones that run in you machine (the context switch only happens within these two threads).
how much time cpu is used in switching threads instead of running
them
Let's say you have 100 million FPU to perform.
Load them in a synchronized queue (i.e., threads must lock the queue when polling)
Let n be the number of processors available on your device (duo=2, etc...)
Then create n threads sucking on the queue to perform all FPU. You can compute total time with System.currentTimeMillis() before and after. Then try with n+1 threads, then n+2, n+3, etc...
In theory, the more threads you have, the more switching there will be, the more time it should take to process all FPU. It will give you a very rough idea of the switching overhead, but this is hard to measure.
how much synchronization traffic is created on shared memory bus -
when threads share data, they must use synchronization mechanism
I would create 10 threads sending each 10 000 messages to another thread randomly by using a synchronized blocking queue of 100 messages. Each thread would peek the blocking queue to check whether the message is for them or not and pull it out if true. Then, they would try to push a message in without blocking, then repeat the peek operation, etc... until the queue is empty and all threads return.
On its way, each thread would could the number of successful push and peek/pull versus unsuccessful. Then, you would have a rough idea of useful work versus useless work in the synchronization traffic. Again, this is hard to measure.
Of course, you could play with the number of threads or the size of the blocking queue too.
A java process starts 5 threads , each thread takes 5 minutes. what will be the minimum and maximum time taken by process? will be of great help if one can explain in java threads and OS threads.
Edit : I want to know how java schedule threads at OS level.
This depends on the amount of logical processor cores you have and the already running processes and the priority of the threads. The theoretical minimum would be 5 minutes plus the little overhead in starting and controlling threads, if you have at least five logical processor cores. The theoretical maximum would be 25 minutes plus the little overhead, if you have only one logical processor core available. The mentioned overhead is usually not more than a few milliseconds.
The theoretical maximum can however be unpredictably (much) higher if there are at the same time a lot of another running threads with a higher priority from other processes than the JVM.
Edit : I want to know how java schedule threads at OS level.
The JVM just spawns another native thread and it get assigned to the process associated with JVM itself.
Minimum time, 5 minutes, assuming that threads run entirely concurrently with no interdependencies and have a dedicated core available. Maximum time, 25 minutes, assuming that each thread has to have exclusive use of some global resource and so can't run in parallel with any other thread.
A glib (but realistic answer) for the maximum is that they might take an infinite amount of time to complete, as multi-threaded programs often contain deadlock bugs.
It depends! There isn't enough information to quantify this.
Missing Info: Hardware - How many threads can run at the same time on your CPU. Workload - Does it take 5 minutes because it's doing something for 5 minutes or is it performing some calculation that usually takes about 5 min's and uses a lot of CPU resources.
When you run multiple threads concurrently there can be lock waits for resources or the threads may even have to take turns executing and although they have been running for 5 minuets they may only have had a few CPU seconds.
5 threads never euqals 5X output. It can get close but will never reach 5X.
I am not sure whether you are looking for CPU time spent by the thread. If that is the case, you can measure the CPU time, see below
ThreadMXBean tb = ManagementFactory.getThreadMXBean()
long startTime= tb.getCurrentThreadCpuTime();
Call the above when thread is created
long endTime= tb.getCurrentThreadCpuTime();
The difference between endTime - starTime, is the CPU time that the thread used