I'm learning reactive programming techniques, with async I/O etc, and I just can't find decent authoritative comparative data about the benefits of not switching threads.
Apparently switching threads is "expensive" compared to computations. But what scale are we talking on?
The essential question is "How many processor cycles/instructions does it take to switch a java thread?" (I'm expecting a range)
Is it affected by OS?
I presume it's affected by number of threads, which is why async IO is so much better than blocking - the more threads, the further away the context has to be stored (presumably even out of the cache into main memory).
I've seen Approximate timings for various operations which although it's (way) out of date, is probably still useful for relating processor cycles (network would likely take more "instructions", SSD disk probably less).
I understand that reactive applications enable web apps to go from 1000's to 10,000's requests per second (per server), but that's hard to tell too - comments welcome
NOTE - I know this is a bit of a vague, useless, fluffy question at the moment because I have little idea on the inputs that would affect the speed of a context switch. Perhaps statistical answers would help - as an example I'd guess >=60% of threads would take between 100-10000 processor cycles to switch.
Thread switching is done by the OS, so Java has little to do with it. Also, on linux at least, but I presume also many other operating systems, the scheduling cost does not depend on the number of threads. Linux has been using an O(1) scheduler since version 2.6.
The thread switch overhead on Linux is some 1.2 µs (article from 2018). Unfortunately the article doesn't list the clock speed at which that was measured, but the overhead should be some 1000-2000 clock cycles or thereabout. On a given machine and OS the thread switching overhead should be more or less constant, not a wide range.
Apart from this direct switching cost there's also the cost of changing workload: the new thread is most likely using a different set of instructions and data, which need to be loaded into the cache, but this cost doesn't differ between a thread switch or an asynchronous programming 'context switch'. And for completeness, switching to an entirely different process has the additional overhead of changing the memory address space, which is also significant.
By comparison, the switching overhead between goroutines in the Go programming language (which uses userspace threads which are very similar to asynchronous programming techniques) was around 170 ns, so one seventh of a linux thread switch.
Whether that is significant for you depends on your use case of course. But for most tasks, the time you spend doing computation will be far more than the context switching overhead. Unless you have many threads that do an absolutely tiny amount of work before switching.
Threading overhead has improved a lot since the early 2000s, and according to the linked article running 10,000 threads in production shouldn't be a problem on a recent server with a lot of memory. General claims of thread switching being slow are often based on yesteryears computers, so take those with a grain of salt.
One remaining fundamental advantage of asynchronous programming is that the userspace scheduler has more knowledge about the tasks, and so can in principle make smarter scheduling decisions. It also doesn't have to deal with processes from different users doing wildly different things that still need to be scheduled fairly. But even that can be worked around, and with the right kernel extensions these Google engineers were able to reduce the thread switching overhead to the same range as goroutine switches (200 ns).
Rugal has a point. In modern architectures theoretical turn-around times are usually far off from actual measurements because the hardware, as well as the software have become so much more complex. It also inherently depends on your application. Many web-applications for example are I/O-bound where the context switch time matters a lot less.
Also note that context switching (what you refer to as thread switching) is an OS thing and not a Java thing. There is no guarantee as to how "heavy" a context switch in your OS is. It used to take tens if not hundreds of thousands of CPU cycles to do a kernel-level switch, but there are also user-level switches, as well as experimental systems, where even kernel-level switches can take only a few hundred cycles.
Related
I run numerical simulations all the time. I can tell if my simulations don't work (i.e., they fail to give acceptable answers), but because I typically run a variable number of these on designated cores running in the background (as I work), looking at clock time tells me less than nothing about how quickly they ran.
I don't want clock time; I want CPU time. None of the articles seems to mention this little aspect. In particular, the recommendation to use a "quiet" machine seems to blur what's being measured.
I don't need a great deal of detail, I just want to know that simulation A runs about 15% faster or slower than simulation B or C, despite the fact that A ran by itself for a while, and then I started B, followed by C. And maybe I played for a little while before retiring, which would run a higher-priority application for part of that time. Don't tell me that ideally I should use a "quiet" machine; my question specifically asks how to do benchmarking without a dedicated machine for this. I also do not wish to kill the efficiency of my applications while measuring how long they take to run; it seems that significant overhead would only be required when a great deal of detail is needed. Am I right?
I want to modify my applications so that when I check whether a batch job succeeds, I can also see how long it took to reach these results in CPU time. Can benchmarking give me the answers I'm looking for? Can I simply use Java 9's benchmarking harness, or do I need something else?
You can measure CPU time instead of wall-clock time from outside the JVM easily enough on most OSes. e.g. time java foo.jar on Unix/Linux, or even perf stat java foo.jar on Linux.
The biggest problem with this is that some workloads have more parallelism than others. Consider this simple example. It's unrealistic, but the math works the same for real programs that alternate between more-parallel and less-parallel phases.
version A is purely serial for 9 minutes, and keeps 8 cores saturated for 1 minute. Wall-clock time = 10 minutes, CPU time = 17 minutes
version B is serial for 1 minute, and keeps all 8 cores busy for 5 minutes. Wall time = 6 minutes, CPU time = 5*8 + 1 = 41 minutes
If you were just looking at CPU time, you wouldn't know which version was stuck on an inherently serial portion of its work. (And this is assuming purely CPU-bound, no I/O waiting.)
For two similar implementations that are both mostly serial, though, CPU time and wall time could give you a reasonable guess.
But modern JVMs like HotSpot use multi-threaded garbage-collection, so even if your own code never starts multiple threads, one version that makes the GC do more work can use more CPU time but still be faster. That might be rare, though.
Another confounding factor: contention for memory bandwidth and cache footprint will mean that it takes more CPU time to do the same work, because your code will spend more time waiting for memory.
And with HyperThreading or other SMT cpu architectures (like Ryzen) where one physical core can act as multiple logical cores, having both logical cores active increases total throughput at the cost of lower per-thread performance.
So 1 minute of CPU time on a core where the HT sibling is idle can get more work done that when the other logical core was also active.
With both logical cores active, a modern Skylake or Ryzen might give you somewhere from 50 to 99% of the single-thread performance of having all the execution resources available for a single core, completely dependent on what the code is running on each thread. (If both bottleneck on latency of FP add and multiply with very long loop-carried dependency chains that out-of-order execution can't see past, e.g. both summing very large arrays in order with strict FP, that's the best case for HT. Neither thread will slow the other down, because FP add throughput is 3 to 8x FP add latency.)
But in the worst case, if both tasks slow slow down a lot from L1d cache misses, HT can even lose throughput from running both at once on the same core, vs. running one then the other.
Is there any easy, cheap (which don't require to test program on many hardware configuration) and painless method to define hardware requirements (like CPU, RAM memory etc), that are require to run my own program? How it's should be done?
I have quite resource-hungry program written in Java and i don't know how to define hardware specification that will be enough to run this aplication smoothly.
No, I don't think there is any generally applicable way to determine the minimum requirements that does not involve testing on some specified reference hardware.
You may be able to find some of the limitations by using Virtual Machines of some kind - it is easier to modify the parameters of some VM than modifying hardware. But there are artifacts generated by the interaction between host and VM that may influence your results.
It is also difficult to define the criteria for "acceptable performance" in general without knowing a lot about use cases.
Many programs will use more resources if they are available, but can also get along with less.
For example, consider a program using a thread pool with a size a based on the number of CPU cores. When running on a CPU with more cores, more work can be done in parallel, but at the same time overhead due to thread creation, synchronisation and aggregation of results increases. The effects are non-linear in the number of CPUs and depend a lot on the actual program and data. Similarly, the effects of decreasing available memory range from potentially throwing OutOfMemory-Errors for some inputs (but possibly not for others) to just running GC a bit more frequently (and the effects of that depend on the GC strategy, ranging from noticeable freezes to just a bit more CPU load).
All that is without even considering that programs don't usually live in isolation - they run on an operating system in parallel with other tasks that also consume resources.
I have basic idea about concurrency but I have a confusion about the following architecture. I think it is concurrent but my colleague thinks it is not. The architecture is as follows:
I have multiple robots which publish its data to its individual gateways and there's another java service which listens on the gateways. The service creates a new thread to listen to each gateway.
My understanding is that the service is performing concurrent execution but my colleague says this is not concurrent as concurrency involves sharing of hardware.
Appreciate if some one can clarify or elaborate on this topic.
My understanding is that the service is performing concurrent execution but my colleague says this is not concurrent as concurrency involves sharing of hardware.
TL/DR: Words are squishy. That's why we have code.
"Concurrent" simply means two or more things happening at the same time. As it applies to computation, true concurrency means two or more threads of execution running at the same time, which requires separate hardware. That certainly can be separate cores of the same CPU or separate CPUs in the same chassis, so that there is some degree of shared hardware. It can also be separate cores in different chassis, however, such as in a computational cluster, though perhaps this is where your colleague is drawing his line. Such a line would be pretty arbitrary, though.
In contrast, long before it was common for even servers to feature multiple CPU (core)s, many computer systems implemented one flavor or another of multitasking, whereby multiple tasks can all be in progress at the same time by virtue of the operating system allotting slices of CPU time to each and switching them in and out. All modern general-purpose operating systems still do this. On a single core, however, this provides only simulated concurrency, because at any given instant in time, only one computation is actually making progress.
Your colleague does have a point, however, that multiple, spatially distributed robots all operating at the same time without coordination is a bit beyond what people usually mean when they talk about concurrent computation. Certainly such robots are operating concurrently, in the general-use sense of "at the same time", but it's a bit of a stretch to characterize them as participating in a concurrent computation.
The server that allocates a separate thread to handle communication with each robot may thereby be performing a concurrent computation. But as long as we're splitting hairs, do recognize that communication over a single network interface is serialized, so unless your server has multiple network interfaces, the actual communication cannot be truly concurrent. If the server is primarily just recording the data as it arrives, as opposed to incorporating it into an ongoing concurrent computation, then it would be potentially misleading to describe it as performing a concurrent operation.
Even by your colleague's definition, this is a concurrent system since there are multiple threads executing on the hardware on which the service resides.
I am new to multithreading in Java, after looking at Java virtual machine - maximum number of threads it would appear there isn't a limit to how many threads a Java/Android app can run. However, is there an advisable limit? What I mean by this is, is there a number of threads where if you run past this number then it is unwise because you are unable to determine what thread does what at what time? I hope my question makes sense.
There are some advisable limits, however they don't really have anything to do with keeping track of them.
Most multithreading comes with locking. If you are using central data storage or global mutable state then the more threads you have, the more lock contention you will get. This is app-specific and depends on how much of said state you have and how often threads read and write it.
There are no limits in desktop JVMs by default, but there are OS limits.It should be in the tens of thousands for modern Windows machines, but don't rely on the ability to create much more than that.
Running multiple tasks in parallel is great, but the hardware can only cope with so much. If you are using small threads that get fired up sometimes, and spend most their time idle, that's no biggie (Java servers were written like this for years). However if your threads are very intensive, making more of them than the number of cores you have is not likely to give you any benefit. (I believe the standard practice is twice the number of cores if you anticipate threads going idle sometimes).
Threads have a cost to them. Whenever you switch Threads you switch context, and while it isn't that expensive, doing it constantly will hurt performance. It's not a good idea to create a Thread to sum up two integers and write back a result.
If Threads need visibility of each others state, then they are greatly slowed down, since a lot of their writes have to be written back to main memory. Threads are best used for standalone tasks that require little interaction with each other.
TL;DR
Depends on OS and Hardware: on servers creating thousands of threads is fine, on desktop machines you should limit yourself to 50-200 and choose carefully what you do with them.
Note: Androids default and suggested "UI multithread helper" - the AsyncTask is not actually a thread. It's a task invoked from a ThreadPool, and as such there is no limit or penalty to using it. It has an upper limit on the number of threads it spawns and reuses them rather than creating new ones. Most Android apps should use it instead of spawning their own threads. In general, Thread Pools are fairly widespread and are a great choice unless you are forced into blocking operations.
I am implementing a worker pool in Java.
This is essentially a whole load of objects which will pick up chunks of data, process the data and then store the result. Because of IO latency there will be significantly more workers than processor cores.
The server is dedicated to this task and I want to wring the maximum performance out of the hardware (but no I don't want to implement it in C++).
The simplest implementation would be to have a single Java process which creates and monitors a number of worker threads. An alternative would be to run a Java process for each worker.
Assuming for arguments sake a quadcore Linux server which of these solutions would you anticipate being more performant and why?
You can assume the workers never need to communicate with one another.
One process, multiple threads - for a few reasons.
When context-switching between jobs, it's cheaper on some processors to switch between threads than between processes. This is especially important in this kind of I/O-bound case with more workers than cores. The more work you do between getting I/O blocked, the less important this is. Good buffering will pay for threads or processes, though.
When switching between threads in the same JVM, at least some Linux implementations (x86, in particular) don't need to flush cache. See Tsuna's blog. Cache pollution between threads will be minimized, since they can share the program cache, are performing the same task, and are sharing the same copy of the code. We're talking savings on the order of 100's of nanoseconds to several microseconds per switch. If that's small potatoes for you, then read on...
Depending on the design, the I/O data path may be shorter for one process.
The startup and warmup time for a thread is generally much shorter. The OS doesn't have to start a process, Java doesn't have to start another JVM, classloading is only done once, JIT-compilation is only done once, and HotSpot optimizations are done once, and sooner.
Well usually, when discussing multi processing (/w one thread per process) versus multi threading in the same process, while the theoretical overhead is bigger in the first case than in the latter (and thus multi processing is theoretically slower than multi threading), in reality on most modern OSs this is not such a big issue. However when discussing it in the Java context, starting a new process is a lot more costly then starting a new thread. Starting a new process means starting up a new instance of the JVM which is very costly especially in terms of memory. I recommend that you start multiple threads in the same JVM.
Moreover, if you say inter-thread communication is not an issue, you can use Java's Executor Service to get a fixed thread pool of size 2x(number of available CPUs). The number of available CPU's can be autodetected at runtime via Java's Runtime class. This way you get a quick simple multithreading going without any boiler plate code.
Actually, if you do this with large scale taks using multiple jvm process is way faster than one jvm with multple threads. At least we never got one jvm runnning as fast as multple jvms.
We do some calculations where each task uses around 2-3GB ram and does some heavy number crunching. If we spawn 30 jvm's and run 30 task they perform around 15-20% better than spawning 30 threads in one jvm. We tried tuning the gc and the various memory sections and never catched up to the first variant.
We did this on various machines 14 tasks on a 16 core server, 34 tasks on a 36 core server etc. Multithreading in java always performed worde than multiple jvm processes.
It may not make any difference on simple tasks but on heavy calculations it seems jvm performce bad on threads.