Waiting Threads Resource Consumption

Waiting Threads Resource Consumption - java

My Problem:
Does large numbers of threads in JVM consume a lot of resources (memory, CPU), when the threads are TIMED_WAIT state (not sleeping) >99.9% of the time? When the threads are waiting, how much CPU overhead does it cost to maintain them if any are needed at all?
Does the answer also apply to non-JVM related environments (like linux kernels)?
Context:
My program receives a large number of space consuming packages. It store counts of similar attributes within the different packages. After a given period of time after receiving a package(could be hours or days), that specific package expires and any count the package contributed to should be decremented.
Currently, I achieve these functionalities by storing all the packages in memory or disk. Every 5 minutes, I delete the expired packages from storage, and scan through the remaining packages to count the attributes. This method uses up a lot of memory, and has bad time complexity (O(n) for time and memory where n is the number of unexpired packages). This makes scalability of the program terrible.
One alternative way to approach this problem is to increment the attribute count every time a package comes by and start a Timer() thread that decrements the attribute count after the package expires. This eliminates the need to store all the bulky packages and cut the time complexity to O(1). However, this creates another problem as my program will start having O(n) number of threads, which could cut into performance. Since most of the threads will be in the TIMED_WAIT state (Java’s Timer() invokes the Object.wait(long) method) the vast majority of their lifecycle, does it still impact the CPU in a very large way?

First, a Java (or .NET) thread != a kernel/OS thread.
A Java Thread is a high level wrapper that abstracts some of the functionality of a system thread; these kinds of threads are also known as managed threads. At the kernel level a thread only has 2 states, running and not running. There's some management information (stack, instruction pointers, thread id, etc.) that the kernel keeps track of, but there is no such thing at the kernel level as a thread that is in a TIMED_WAITING state (the .NET equivalent to the WaitSleepJoin state). Those "states" only exists within those kinds of contexts (part of why the C++ std::thread does not have a state member).
Having said that, when a managed thread is being blocked, it's being done so in a couple of ways (depending on how it is being requested to be blocked at the managed level); the implementations I've seen in the OpenJDK for the threading code utilize semaphores to handle the managed waits (which is what I've seen in other C++ frameworks that have a sort of "managed" thread class as well as in the .NET Core libraries), and utilize a mutex for other types of waits/locks.
Since most implementations will utilize some sort of locking mechanism (like a semaphore or mutex), the kernel generally does the same thing (at least where your question is concerned); that is, the kernel will take the thread off of the "run" queue and put it in the "wait" queue (a context switch). Getting into thread scheduling and specifically how the kernel handles the execution of the threads is beyond the scope of this Q&A, especially since your question is in regards to Java and Java can be run on quite a few different types of OS (each of which handles threading completely differently).
Answering your questions more directly:
Does large numbers of threads in JVM consume a lot of resources (memory, CPU), when the threads are TIMED_WAIT state (not sleeping) >99.9% of the time?
To this, there are a couple of things to note: the thread created consumes memory for the JVM (stack, ID, garbage collector, etc.) and the kernel consumes kernel memory to manage the thread at the kernel level. That memory that is consumed does not change unless you specifically say so. So if the thread is sleeping or running, the memory is the same.
The CPU is what will change based on the thread activity and the number of threads requested (remember, a thread also consumes kernel resources, thus has to be managed at a kernel level, so the more threads that have to be handled, the more kernel time must be consumed to manage them).
Keep in mind that the kernel times to schedule and run the threads are extremely minuscule (that's part of the point of the design), but it's still something to consider if you plan on running a lot of threads; additionally, if you know your application will be running on a CPU (or cluster) with only a few cores, the fewer cores you have available to you, the more the kernel has to context switch, adding additional time in general.
When the threads are waiting, how much CPU overhead does it cost to maintain them if any are needed at all?
None. See above, but the CPU overhead used to manage the threads does not change based on the thread context. Extra CPU might be used for context switching and most certainly extra CPU will be utilized by the threads themselves when active, but there's no additional "cost" to the CPU to maintain a waiting thread vs. a running thread.
Does the answer also apply to non-JVM related environments (like linux kernels)?
Yes and no. As stated, the managed contexts generally apply to most of those types of environments (e.g. Java, .NET, PHP, Lua, etc.), but those contexts can vary and the threading idioms and general functionality is dependant upon the kernel being utilized. So while one specific kernel might be able to handle 1000+ threads per process, some might have hard limits, others might have other issues with higher thread counts per process; you'll have to reference the OS/CPU specs to see what kind of limits you might have.
Since most of the threads will be in the TIMED_WAIT state (Java’s Timer() invokes the Object.wait(long) method) the vast majority of their lifecycle, does it still impact the CPU in a very large way?
No (part of the point of a blocked thread), but something to consider: what if (edge case) all (or >50%) of those threads need to run at the exact same time? If you only have a few threads managing your packages, that might not be an issue, but say you have 500+; 250 threads all being woken at the same time would cause massive CPU contention.
Since you haven't posted any code, it's hard to make specific suggestions to your scenario, but one would be inclined to store a structure of attributes as a class and keep that class in a list or hash map that can be referenced in a Timer (or a separate thread) to see if the current time matches the expiration time of the package, then the "expire" code would run. This cuts down the number of threads to 1 and the access time to O(1); but again, without code, that suggestion might not work in your scenario.
Hope that helps.

Related

Does how many processes can be executed dependent on number of cores [duplicate]

Please I got confused about something.
What I know is that the maximum number of threads that can run concurrently on a normal CPU of a modern computer ranges from 8 to 16 threads.
On the other hand, using GPUs thousands of threads can run concurrently without the scheduler interrupting any thread to schedule another one.
On several posts as:
Java virtual machine - maximum number of threads https://community.oracle.com/message/10312772
people are stating that they run thousands of java threads concurrently on normal CPUs.
How could this be ??
And how can I know the maximum number of threads that can run concurrently so that my code adjusts it self dynamically according to the underlying architecture.

Threads aren't tied to or limited by the number of available processors/cores. The operating system scheduler can switch back and forth between any number of threads on a single CPU. This is the meaning of "preemptive multitasking."
Of course, if you have more threads than cores, not all threads will be executing simultaneously. Some will be on hold, waiting for a time slot.
In practice, the number of threads you can have is limited by the scheduler - but that number is usually very high (thousands or more). It will vary from OS to OS and with individual versions.
As far as how many threads are useful from a performance standpoint, as you said it depends on the number of available processors and on whether the task is IO or CPU bound. Experiment to find the optimal number and make it configurable if possible.

There is hardware and software concurrency. The 8 to 16 threads refers to the hardware you have - that is one or more CPUs with hardware to execute 8 to 16 threads parallel to each other. The thousands of threads refers to the number of software threads, the scheduler will have to swap them out so every software thread gets its time slice to run on the hardware.
To get the number of hardware threads you can try Runtime.availableProcessors().

At any given time, a processor will run the number of threads equal to the number of cores contained. This means that on a uniprocessor system, only one thread (or no thread) is being run at any given moment.
However, processors do not run each thread one after another, rather they switch between multiple threads rapidly to simulate concurrent execution. If this weren't the case let alone create multiple threads, you won't even be able to start multiple applications.
A java thread (compared to processor instructions) is a very high level abstraction of a set of instructions for the CPU to process. When it gets down to the processor level, there is no guarantee which threads will run on which core at any given time. But given that processors rapidly switch between these threads, it is theoretically possible to create an infinite amount of threads albeit at the cost of performance.
If you think about it, a modern computer has thousands of threads running at the same time (combining all applications) while only having 1 ~ 16 (typical case) number of cores. Without this task-switching, nothing would ever get done.
If you are optimizing your application, you should consider the amount of threads you need by the work at hand, and not by the underlying architecture. Performance gains from parallelism should be weighted against increasing overheads of thread execution. Since every machine is different, every runtime environment is different, it is impractical to work out some golden thread count (however, a ballpark estimate may be made by benchmarking and looking at number of cores).

While all the other answers have explained how you can theoretically have thousands of threads in your application at the cost of memory and other overheads already well explained here. It is however worth noting that the default concurrencyLevel for the data structures provided in the java.util.concurrent package is 16.
You will come across contention issues if you don't account for the same.
Using a significantly higher value than you need can waste space and time, and a significantly lower value can lead to thread contention.
Make sure you have set the appropriate concurrencyLevel in case you are running into issues related to concurrency with a higher number of threads.

Thread locality

I have this statement, which came from Goetz's Java Concurrency In Practice:
Runtime overhead of threads due to context switching includes saving and restoring execution context, loss of locality, and CPU time spent scheduling threads instead of running them.
What is meant by "loss of locality"?

When a thread works, it often reads data from memory and from disk. The data is often stored in contiguous or close locations in memory/on the disk (for example, when iterating over an array, or when reading the fields of an object). The hardware bets on that by loading blocks of memory into fast caches so that access to contiguous/close memory locations is faster.
When you have a high number of threads and you switch between them, those caches often need to be flushed and reloaded, which makes the code of a thread take more time than if it was executed all at once, without having to switch to other threads and come back later.
A bit like we humans need some time to get back to a task after being interrupted, find where we were, what we were doing, etc.

Just to elaborate the point of "cache miss" made by JB Nizet.
As a thread runs on a core, it keeps recently used data in the L1/L2 cache which are local to the core. Modern processors typically read data from L1/L2 cache in about 5-7 ns.
When, after a pause (from being interrupted, put on wait queue etc) a thread runs again, it most likely will run on a different core. This means that the L1/L2 cache of this new core has no data related to the work that the thread was doing. It now needs to goto the main memory (which takes about 100 ns) to load data before proceeding to work.
There are ways to mitigate this issue by pinning threads to a specific core by using a thread affinity library.

What is a Thread? How many threads does my program have? [duplicate]

I have been trying to find a good definition, and get an understanding, of what a thread really is.
It seems that I must be missing something obvious, but every time I read about what a thread is, it's almost a circular definition, a la "a thread is a thread of execution" or " a way to divide into running tasks". Uh uh. Huh?
It seems from what I have read that a thread is not really something concrete, like a process is. It is in fact just a concept. From what I understand of the way this works, a processor executes some commands for a program (which has been termed a thread of execution), then when it needs to switch to processing for some other program for a bit, it stores the state of the program it's currently executing for somewhere (Thread Local Storage) and then starts executing the other program's instructions. And back and forth. Such that, a thread is really just a concept for "one of the paths of execution" of a program that is currently running.
Unlike a process, which really is something - it is a conglomeration of resources, etc.
As an example of a definition that didn't really help me much . . .
From Wikipedia:
"A thread in computer science is short for a thread of execution. Threads are a way for a program to divide (termed "split") itself into two or more simultaneously (or pseudo-simultaneously) running tasks. Threads and processes differ from one operating system to another but, in general, a thread is contained inside a process and different threads in the same process share same resources while different processes in the same multitasking operating system do not."
So am I right? Wrong? What is a thread really?
Edit: Apparently a thread is also given its own call stack, so that is somewhat of a concrete thing.

A thread is an execution context, which is all the information a CPU needs to execute a stream of instructions.
Suppose you're reading a book, and you want to take a break right now, but you want to be able to come back and resume reading from the exact point where you stopped. One way to achieve that is by jotting down the page number, line number, and word number. So your execution context for reading a book is these 3 numbers.
If you have a roommate, and she's using the same technique, she can take the book while you're not using it, and resume reading from where she stopped. Then you can take it back, and resume it from where you were.
Threads work in the same way. A CPU is giving you the illusion that it's doing multiple computations at the same time. It does that by spending a bit of time on each computation. It can do that because it has an execution context for each computation. Just like you can share a book with your friend, many tasks can share a CPU.
On a more technical level, an execution context (therefore a thread) consists of the values of the CPU's registers.
Last: threads are different from processes. A thread is a context of execution, while a process is a bunch of resources associated with a computation. A process can have one or many threads.
Clarification: the resources associated with a process include memory pages (all the threads in a process have the same view of the memory), file descriptors (e.g., open sockets), and security credentials (e.g., the ID of the user who started the process).

A thread is an independent set of values for the processor registers (for a single core). Since this includes the Instruction Pointer (aka Program Counter), it controls what executes in what order. It also includes the Stack Pointer, which had better point to a unique area of memory for each thread or else they will interfere with each other.
Threads are the software unit affected by control flow (function call, loop, goto), because those instructions operate on the Instruction Pointer, and that belongs to a particular thread. Threads are often scheduled according to some prioritization scheme (although it's possible to design a system with one thread per processor core, in which case every thread is always running and no scheduling is needed).
In fact the value of the Instruction Pointer and the instruction stored at that location is sufficient to determine a new value for the Instruction Pointer. For most instructions, this simply advances the IP by the size of the instruction, but control flow instructions change the IP in other, predictable ways. The sequence of values the IP takes on forms a path of execution weaving through the program code, giving rise to the name "thread".

In order to define a thread formally, we must first understand the boundaries of where a thread operates.
A computer program becomes a process when it is loaded from some store into the computer's memory and begins execution. A process can be executed by a processor or a set of processors. A process description in memory contains vital information such as the program counter which keeps track of the current position in the program (i.e. which instruction is currently being executed), registers, variable stores, file handles, signals, and so forth.
A thread is a sequence of such instructions within a program that can be executed independently of other code. The figure shows the concept:
Threads are within the same process address space, thus, much of the information present in the memory description of the process can be shared across threads.
Some information cannot be replicated, such as the stack (stack pointer to a different memory area per thread), registers and thread-specific data. This information suffices to allow threads to be scheduled independently of the program's main thread and possibly one or more other threads within the program.
Explicit operating system support is required to run multithreaded programs. Fortunately, most modern operating systems support threads such as Linux (via NPTL), BSD variants, Mac OS X, Windows, Solaris, AIX, HP-UX, etc. Operating systems may use different mechanisms to implement multithreading support.
Here, you can find more information about the topic. That was also my information-source.
Let me just add a sentence coming from Introduction to Embedded System by Edward Lee and Seshia:
Threads are imperative programs that run concurrently and share a memory space. They can access each others’ variables. Many practitioners in the field use the term “threads” more narrowly to refer to particular ways of constructing programs that share memory, [others] to broadly refer to any mechanism where imperative programs run concurrently and share memory. In this broad sense, threads exist in the form of interrupts on almost all microprocessors, even without any operating system at all (bare iron).

Processes are like two people using two different computers, who use the network to share data when necessary. Threads are like two people using the same computer, who don't have to share data explicitly but must carefully take turns.
Conceptually, threads are just multiple worker bees buzzing around in the same address space. Each thread has its own stack, its own program counter, etc., but all threads in a process share the same memory. Imagine two programs running at the same time, but they both can access the same objects.
Contrast this with processes. Processes each have their own address space, meaning a pointer in one process cannot be used to refer to an object in another (unless you use shared memory).
I guess the key things to understand are:
Both processes and threads can "run at the same time".
Processes do not share memory (by default), but threads share all of their memory with other threads in the same process.
Each thread in a process has its own stack and its own instruction pointer.

I am going to use a lot of text from the book Operating Systems Concepts by ABRAHAM SILBERSCHATZ, PETER BAER GALVIN and GREG GAGNE along with my own understanding of things.
Process
Any application resides in the computer in the form of text (or code).
We emphasize that a program by itself is not a process. A program is a
passive entity, such as a file containing a list of instructions stored on disk
(often called an executable file).
When we start an application, we create an instance of execution. This instance of execution is called a process.
EDIT:(As per my interpretation, analogous to a class and an instance of a class, the instance of a class being a process. )
An example of processes is that of Google Chrome.
When we start Google Chrome, 3 processes are spawned:
• The browser process is responsible for managing the user interface as
well as disk and network I/O. A new browser process is created when
Chrome is started. Only one browser process is created.
• Renderer processes contain logic for rendering web pages. Thus, they
contain the logic for handling HTML, Javascript, images, and so forth.
As a general rule, a new renderer process is created for each website
opened in a new tab, and so several renderer processes may be active
at the same time.
• A plug-in process is created for each type of plug-in (such as Flash
or QuickTime) in use. Plug-in processes contain the code for the
plug-in as well as additional code that enables the plug-in to
communicate with associated renderer processes and the browser
process.
Thread
To answer this I think you should first know what a processor is. A Processor is the piece of hardware that actually performs the computations.
EDIT: (Computations like adding two numbers, sorting an array, basically executing the code that has been written)
Now moving on to the definition of a thread.
A thread is a basic unit of CPU utilization; it comprises a thread ID, a program
counter, a register set, and a stack.
EDIT: Definition of a thread from intel's website:
A Thread, or thread of execution, is a software term for the basic ordered sequence of instructions that can be passed through or processed by a single CPU core.
So, if the Renderer process from the Chrome application sorts an array of numbers, the sorting will take place on a thread/thread of execution. (The grammar regarding threads seems confusing to me)
My Interpretation of Things
A process is an execution instance. Threads are the actual workers that perform the computations via CPU access. When there are multiple threads running for a process, the process provides common memory.
EDIT:
Other Information that I found useful to give more context
All modern day computer have more than one threads. The number of threads in a computer depends on the number of cores in a computer.
Concurrent Computing:
From Wikipedia:
Concurrent computing is a form of computing in which several computations are executed during overlapping time periods—concurrently—instead of sequentially (one completing before the next starts). This is a property of a system—this may be an individual program, a computer, or a network—and there is a separate execution point or "thread of control" for each computation ("process").
So, I could write a program which calculates the sum of 4 numbers:
(1 + 3) + (4 + 5)
In the program to compute this sum (which will be one process running on a thread of execution) I can fork another process which can run on a different thread to compute (4 + 5) and return the result to the original process, while the original process calculates the sum of (1 + 3).

This was taken from a Yahoo Answer:
A thread is a coding construct
unaffect by the architecture of an
application. A single process
frequently may contain multiple
threads. Threads can also directly
communicate with each other since they
share the same variables.
Processes are independent execution
units with their own state
information. They also use their own
address spaces and can only interact
with other processes through
interprocess communication mechanisms.
However, to put in simpler terms threads are like different "tasks". So think of when you are doing something, for instance you are writing down a formula on one paper. That can be considered one thread. Then another thread is you writing something else on another piece of paper. That is where multitasking comes in.
Intel processors are said to have "hyper-threading" (AMD has it too) and it is meant to be able to perform multiple "threads" or multitask much better.
I am not sure about the logistics of how a thread is handled. I do recall hearing about the processor going back and forth between them, but I am not 100% sure about this and hopefully somebody else can answer that.

A thread is nothing more than a memory context (or how Tanenbaum better puts it, resource grouping) with execution rules. It's a software construct. The CPU has no idea what a thread is (some exceptions here, some processors have hardware threads), it just executes instructions.
The kernel introduces the thread and process concept to manage the memory and instructions order in a meaningful way.

A thread is a set of (CPU)instructions which can be executed.
But in order to have a better understanding of what a thread is, some computer architecture knowledge is required.
What a computer does, is to follow instructions and manipulate data.
RAM is the place where the instructions and data are saved, the processor uses those instructions to perform operations on the saved data.
The CPU has some internal memory cells called, registers. It can perform simple mathematical operations with numbers stored in these registers. It can also move data between the RAM and these registers. These are examples of typical operations a CPU can be instructed to execute:
Copy data from memory position #220 into register #3
Add the number in register #3 to the number in register #1.
The collection of all operations a CPU can do is called instruction set. Each operation in the instruction set is assigned a number. Computer code is essentially a sequence of numbers representing CPU operations. These operations are stored as numbers in the RAM. We store input/output data, partial calculations, and computer code, all mixed together in the RAM.
The CPU works in a never-ending loop, always fetching and executing an instruction from memory. At the core of this cycle is the PC register, or Program Counter. It's a special register that stores the memory address of the next instruction to be executed.
The CPU will:
Fetch the instruction at the memory address given by the PC,
Increment the PC by 1,
Execute the instruction,
Go back to step 1.
The CPU can be instructed to write a new value to the PC, causing the execution to branch, or "jump" to somewhere else in the memory. And this branching can be conditional. For instance, a CPU instruction could say: "set PC to address #200 if register #1 equals zero". This allows computers to execute stuff like this:
if x = 0
compute_this()
else
compute_that()
Resources used from Computer Science Distilled.

The answer varies hugely across different systems and different implementations, but the most important parts are:
A thread has an independent thread of execution (i.e. you can context-switch away from it, and then back, and it will resume running where it was).
A thread has a lifetime (it can be created by another thread, and another thread can wait for it to finish).
It probably has less baggage attached than a "process".
Beyond that: threads could be implemented within a single process by a language runtime, threads could be coroutines, threads could be implemented within a single process by a threading library, or threads could be a kernel construct.
In several modern Unix systems, including Linux which I'm most familiar with, everything is threads -- a process is merely a type of thread that shares relatively few things with its parent (i.e. it gets its own memory mappings, its own file table and permissions, etc.) Reading man 2 clone, especially the list of flags, is really instructive here.

Unfortunately, threads do exist. A thread is something tangible. You can kill one, and the others will still be running. You can spawn new threads.... although each thread is not its own process, they are running separately inside the process. On multi-core machines, 2 threads could run at the same time.
http://en.wikipedia.org/wiki/Simultaneous_multithreading
http://www.intel.com/intelpress/samples/mcp_samplech01.pdf

Just as a process represents a virtual computer, the thread
abstraction represents a virtual processor.
So threads are an abstraction.
Abstractions reduce complexity. Thus, the first question is what problem threads solve. The second question is how they can be implemented.
As to the first question: Threads make implementing multitasking easier. The main idea behind this is that multitasking is unnecessary if every task can be assigned to a unique worker. Actually, for the time being, it's fine to generalize the definition even further and say that the thread abstraction represents a virtual worker.
Now, imagine you have a robot that you want to give multiple tasks. Unfortunately, it can only execute a single, step by step task description. Well, if you want to make it multitask, you can try creating one big task description by interleaving the separate tasks you already have. This is a good start but the issue is that the robot sits at a desk and puts items on it while working. In order to get things right, you cannot just interleave instructions but also have to save and restore the items on the table.
This works, but now it's hard to disentangle the separate tasks by simply looking at the big task description that you created. Also, the ceremony of saving and restoring the items on the tabe is tedious and further clutters the task description.
Here is where the thread abstraction comes in and saves the day. It lets you assume that you have an infinite number of robots, each sitting in a different room at its own desk. Now, you can just throw task descriptions in a pot and everything else is taken care of by the thread abstraction's implementer. Remember? If there are enough workers, nobody has to multitask.
Often it is useful to indicate your perspective and say robot to mean real robots and virtual robot to mean the robots the thread abstraction provides you with.
At this point the problem of multitasking is solved for the case when the tasks are fully independent. However, wouldn't it be nice to let the robots go out of their rooms, interact and work together towards a common goal? Well, as you probably guessed, this requires coordination. Traffic lights, queues - you name it.
As an intermediate summary, the thread abstraction solves the problem of multitasking and creates an opportunity for cooperation. Without it, we only had a single robot, so cooperation was unthinkable. However, it has also brought the problem of coordination (synchronization) on us. Now we know what problem the tread abstraction solves and, as a bonus, we also know what new challenge it creates.
But wait, why do we care about multitasking in the first place?
First, multitasking can increase performance if the tasks involve waiting. For example, while the washing machine is running, you can easily start preparing dinner. And while your dinner is in the over, you can hang out the clothes. Note that here you wait because an independent component does the job for you. Tasks that involve waiting are called I/O bound tasks.
Second, if multitasking is done rapidly, and you look at it from a bird's eyes view, it appears as parallelism. It's a bit like how the human eye perceives a series of still images as motion if shown in quick succession. If I write a letter to Alice for one second and to Bob for one second as well, can you tell if I wrote the two letters simultaneously or alternately, if you only look at what I'm doing every two seconds? Search for Multitasking Operating System for more on this.
Now, let's focus on the question of how the thread abstraction can be implemented.
Essentially, implementing the thread abstraction is about writing a task, a main task, that takes care of scheduling all the other tasks.
A fundamental question to ask is: If the scheduler schedules all tasks and the scheduler is also a task, then who schedules the scheduler?
Let's brake this down. Say you write a scheduler, compile it and load it into the main memory of a computer at the address 1024, which happens to be the address that is loaded into the processor's instruction pointer when the computer is started. Now, your scheduler goes ahead and finds some tasks sitting precompiled in the main memory. For example, a task starts at the address 1,048,576. The scheduler wants to execute this task so it loads the task's address (1,048,576) into the instruction pointer. Huh, that was quite an ill considered move because now the scheduler has no way to regain control from the task it has just started.
One solution is to insert jump instructions to the scheduler (address 1024) into the task descriptions before execution. Actually, you shouldn't forget to save the items on the desk the robot is working at, so you also have to save the processor's registers before jumping. The issue here is that it is hard to tell where to insert the jump instructions. If there are too many, they create too much overhead and if there are too few of them, one task might monopolize the processor.
A second approach is to ask the task authors to designate a few places where control can be transferred back to the scheduler. Note that the authors don't have to write the logic for saving the registers and inserting the jump instruction because it suffices that they mark the appropriate places and the scheduler takes care of the rest. This looks like a good idea because task authors probably know that, for example, their task will wait for a while after loading and starting a washing machine, so they let the scheduler take control there.
The problem that neither of the above approaches solve is that of an erroneous or malicious task that, for example, gets caught up in an infinite loop and never jumps to the address where the scheduler lives.
Now, what to do if you cannot solve something in software? Solve it in hardware! What is needed is a programmable circuitry wired up to the processor that acts like an alarm clock. The scheduler sets a timer and its address (1024) and when the timer runs out, the alarm saves the registers and sets the instruction pointer to the address where the scheduler lives. This approach is called preemptive scheduling.
Probably by now you start to sense that implementing the thread abstraction is not like implementing a linked list. The most well-known implementers of the thread abstraction are operating systems. The threads they provide are sometimes called kernel-level threads. Since an operating system cannot afford losing control, all major, general-purpose operating systems uses preemptive scheduling.
Arguably, operating systems feel like the right place to implement the thread abstraction because they control all the hardware components and can suspend and resume threads very wisely. If a thread requests the contents of a file stored on a hard drive from the operating system, it immediately knows that this operation will most likely take a while and can let another task occupy the processor in the meanwhile. Then, it can pause the current task and resume the one that made the request, once the file's contents are available.
However, the story doesn't end here because threads can also be implemented in user space. These implementers are normally compilers. Interestingly, as far as I know, kernel-level threads are as powerful as threads can get. So why do we bother with user-level threads? The reason, of course, is performance. User-level threads are more lightweight so you can create more of them and normally the overhead of pausing and resuming them is small.
User-level threads can be implemented using async/await. Do you remember that one option to achieve that control gets back to the scheduler is to make task authors designate places where the transition can happen? Well, the async and await keywords serve exactly this purpose.
Now, if you've made it this far, be prepared because here comes the real fun!
Have you noticed that we barely talked about parallelism? I mean, don't we use threads to run related computations in parallel and thereby increase throughput? Well, not quiet.. Actually, if you only want parallelism, you don't need this machinery at all. You just create as many tasks as the number of processing units you have and none of the tasks has to be paused or resumed ever. You don't even need a scheduler because you don't multitask.
In a sense, parallelism is an implementation detail. If you think about it, implementers of the thread abstraction can utilize as many processors as they wish under the hood. You can just compile some well-written multithreaded code from 1950, run it on a multicore today and see that it utilizes all cores. Importantly, the programmer who wrote that code probably didn't anticipate that piece of code being run on a multicore.
You could even argue that threads are abused when they are used to achieve parallelism: Even though people know they don't need the core feature, multitasking, they use threads to get access to parallelism.
As a final thought, note that user-level threads alone cannot provide parallelism. Remember the quote from the beginning? Operating systems run programs inside a virtual computer (process) that is normally equipped with a single virtual processor (thread) by default. No matter what magic you do in user space, if your virtual computer has only a single virtual processor, you cannot run code in parallel.
So what do we want? Of course, we want parallelism. But we also want lightweight threads. Therefore, many implementers of the thread abstraction started to use a hybrid approach: They start as many kernel-level threads as there are processing units in the hardware and run many user-level threads on top of a few kernel-level threads. Essentially, parallelism is taken care of by the kernel-level and multitasking by the user-level threads.
Now, an interesting design decision is what threading interface a language exposes. Go, for example, provides a single interface that allows users to create hybrid threads, so called goroutines. There is no way to ask for, say, just a single kernel-level thread in Go. Other languages have separate interfaces for different kinds of threads. In Rust, kernel-level threads live in the standard library, while user-level and hybrid threads can be found in external libraries like async-std and tokio. In Python, the asyncio package provides user-level threads while multithreading and multiprocessing provide kernel-level threads. Interestingly, the threads multithreading provides cannot run in parallel. On the other hand, the threads multiprocessing provides can run in parallel but, as the library's name suggests, each kernel-level thread lives in a different process (virtual machine). This makes multiprocessing unsuitable for certain tasks because transferring data between different virtual machines is often slow.
Further resources:
Operating Systems: Principles and Practice by Thomas and Anderson
Concurrency is not parallelism by Rob Pike
Parallelism and concurrency need different tools
Asynchronous Programming in Rust
Inside Rust's Async Transform
Rust's Journey to Async/Await
What Color is Your Function?
Why goroutines instead of threads?
Why doesn't my program run faster with more CPUs?
John Reese - Thinking Outside the GIL with AsyncIO and Multiprocessing - PyCon 2018

Number of processor core vs the size of a thread pool

Many times I've heard that it is better to maintain the number of threads in a thread pool below the number of cores in that system. Having twice or more threads than the number of cores is not only a waste, but also could cause performance degradation.
Are those true? If not, what are the fundamental principles that debunk those claims (specifically relating to java)?

Many times I've heard that it is better to maintain the number of threads in a thread pool below the number of cores in that system. Having twice or more threads than the number of cores is not only a waste, but also could cause performance degradation.
The claims are not true as a general statement. That is to say, sometimes they are true (or true-ish) and other times they are patently false.
A couple things are indisputably true:
More threads means more memory usage. Each thread requires a thread stack. For recent HotSpot JVMs, the minimum thread stack size is 64Kb, and the default can be as much as 1Mb. That can be significant. In addition, any thread that is alive is likely to own or share objects in the heap whether or not it is currently runnable. Therefore is is reasonable to expect that more threads means a larger memory working set.
A JVM cannot have more threads actually running than there are cores (or hyperthread cores or whatever) on the execution hardware. A car won't run without an engine, and a thread won't run without a core.
Beyond that, things get less clear cut. The "problem" is that a live thread can in a variety of "states". For instance:
A live thread can be running; i.e. actively executing instructions.
A live thread can be runnable; i.e. waiting for a core so that it can be run.
A live thread can by synchronizing; i.e. waiting for a signal from another thread, or waiting for a lock to be released.
A live thread can be waiting on an external event; e.g. waiting for some external server / service to respond to a request.
The "one thread per core" heuristic assumes that threads are either running or runnable (according to the above). But for a lot of multi-threaded applications, the heuristic is wrong ... because it doesn't take account of threads in the other states.
Now "too many" threads clearly can cause significant performance degradation, simple by using too much memory. (Imagine that you have 4Gb of physical memory and you create 8,000 threads with 1Mb stacks. That is a recipe for virtual memory thrashing.)
But what about other things? Can having too many threads cause excessive context switching?
I don't think so. If you have lots of threads, and your application's use of those threads can result in excessive context switches, and that is bad for performance. However, I posit that the root cause of the context switched is not the actual number of threads. The root of the performance problems are more likely that the application is:
synchronizing in a particularly wasteful way; e.g. using Object.notifyAll() when Object.notify() would be better, OR
synchronizing on a highly contended data structure, OR
doing too much synchronization relative to the amount of useful work that each thread is doing, OR
trying to do too much I/O in parallel.
(In the last case, the bottleneck is likely to be the I/O system rather than context switches ... unless the I/O is IPC with services / programs on the same machine.)
The other point is that in the absence of the confounding factors above, having more threads is not going to increase context switches. If your application has N runnable threads competing for M processors, and the threads are purely computational and contention free, then the OS'es thread scheduler is going to attempt to time-slice between them. But the length of a timeslice is likely to be measured in tenths of a second (or more), so that the context switch overhead is negligible compared with the work that a CPU-bound thread actually performs during its slice. And if we assume that the length of a time slice is constant, then the context switch overhead will be constant too. Adding more runnable threads (increasing N) won't change the ratio of work to overhead significantly.
In summary, it is true that "too many threads" is harmful for performance. However, there is no reliable universal "rule of thumb" for how many is "too many". And (fortunately) you generally have considerable leeway before the performance problems of "too many" become significant.

Having fewer threads than cores generally means you can't take advantage of all available cores.
The usual question is how many more threads than cores you want. That, however, varies, depending on the amount of time (overall) that your threads spend doing things like I/O vs. the amount of time they spend doing computation. If they're all doing pure computation, then you'd normally want about the same number of threads as cores. If they're doing a fair amount of I/O, you'd typically want quite a few more threads than cores.
Looking at it from the other direction for a moment, you want enough threads running to ensure that whenever one thread blocks for some reason (typically waiting on I/O) you have another thread (that's not blocked) available to run on that core. The exact number that takes depends on how much of its time each thread spends blocked.

That's not true, unless the number of threads is vastly more than the number of cores. The reasoning is that additional threads will mean additional context switches. But it's not true because an operating system will only make unforced context switches if those context switches are beneficial, and additional threads don't force additional context switches.
If you create an absurd number of threads, that wastes resources. But none of this is anything compared to how bad creating too few threads is. If you create too few threads, an unexpected block (such as a page fault) can result in CPUs sitting idle, and that swamps any possible harm from a few extra context switches.

Not exactly true, this depends on the overall software architecture. There's a reason of keeping more threads than available cores in case some of the threads are suspended by the OS because they're waiting for an I/O to complete. This may be an explicit I/O invocation (such as synchronous reading from file), as well as implicit, such as system paging handling.
Actually I've read in one book that keeping the number of threads twice the number of CPU cores is is a good practice.

For REST API calls or say I/O-bound operations, having more threads than the number of cores can potentially improve the performance by allowing multiple API requests to be processed in parallel. However, the optimal number of threads depends on various factors such as the API request frequency, the complexity of the request processing, and the resources available on the server.
If the API request processing is CPU-bound and requires a lot of computation, having too many threads may cause resource contention and lead to reduced performance. In such cases, the number of threads should be limited to the number of cores available.
On the other hand, if the API request processing is I/O-bound and involves a lot of waiting for responses from external resources such as databases, having more threads may improve performance by allowing multiple requests to be processed in parallel.
In any case, it is recommended to perform performance testing to determine the optimal number of threads for your specific use case and monitor the system performance using metrics such as response time, resource utilization, and error rate.

In terms of performance, do the threads (such as swing threads) "interfere" with my threads?

I'm developing a small multi-thread application (in java) to help me understand it. As I researched about it, I learned that the ideal amount of threads you would like the number supported by the processor (ie. 4 in an Intel i3, 8 in an Intel i7, I think). But swing alone already has 3 threads + 1 thread (the main, in this case). Does that means that I won't have any significant improvement in a processor which supports 4 threads? Will the swing threads just consume all the processor threads and everything else will just run on the same processor? Is it worthed to multi-thread it (performance-wise) even with those swing threads?
OBS: A maybe important observation that needs to be made is that I will be using a JFrame and doing active-rendering. That's probably as far as I will go with swing.

I learned that the ideal amount of threads you would like the number supported by the processor
That statement is only true if your Threads are occupying the whole CPU. For example the Swing thread (Event Dispatch Thread) is most of the time just waiting for user input.

The number one thing that most threads do is waiting. They are just there so they are ready to go at the instant the system needs their services.
The comment about the ideal thread count goes for the number of threads at 100% workload.

Swing's threads spend a lot of time being idle. The ideal number of threads is wrt threads executing at or near 100% processor time.
You may still not see significant improvements due to other factors, but the threads inherent in swing shouldn't be a concern.

The ideal # of threads is not necessarily governed by the # of cpu's (cores) you have. There's a lot of tuning involved base on the actual code you are executing.
For example, let's take a Runnable which executions some database queries (doesn't matter what). Most of the thread time will be spent blocked waiting for a response from the database. So if you have 4 cores, and execute 4 threads. The odds are that at any given time, many of them are blocked on db calls. You can easily spawn more threads with no ill effects on your cpu. In this case you're limited not by the specs of your machine, but by the degree of concurrency which the db will handle.
Another example would be file I/O, which spends most of it's time waiting for the I/O subsystem to respond with data.
The only real way is evaluation of your multi-threaded code, along with trial and error for a given environment.

Yes, they do but only minimally. There are other threads such as GC and finalizer threads as well that are running in the background. These are all necessary for the JVM to operate just as the Swing threads are necessary for Swing to work.
You shouldn't have to worry about them unless you are on a dramatically small system with little resources or CPU capacity.
With modern systems, many of which have multiple processors and/or multiple cores, the JVM and the OS will run these other threads on other processors and still give your user threads all of the processor power that you will need.
Also, most of the background Swing threads are in wait loops waiting to handle events and make display changes. Unless you do something wrong, they should make up a small amount of your application processor requirements.
I learned that the ideal amount of threads you would like the number supported by the processor
As #Robin mentioned, this is only necessary when you are trying to optimize a program that has a number of CPU-bound operations. For example, our application typically has 1000s of threads but 8 processors and still is very responsive since the threads are all waiting for IO or events. The only time you need to worry about the number of CPUs is when you are doing processor intensive operations and you are trying to maximize your throughput.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.