Is it safe to run more than 1K threads in java?

Is it safe to run more than 1K threads in java? - java

I am making a program which tries to get all the possible outcomes. The threads in the program generate more threads (a little over a thousand). I am really bad at multithreading, I fear that the generation of threads won't stop. I am using the eclipse IDE which has a terminate button, will this stop all the running thread, if not is there any other way? Can the JVM handle this?

Yes. Clicking the terminate button in an eclipse launched JVM will halt that JVM, and that will stop all of the running threads (just like killing the JVM process).
As for running one thousand threads, I wouldn't recommend it... that sounds like a really slow approach (since each thread can only run a maximum of ~n/1000th of the time on n CPU cores).

Actually, there is no restrictions to open 1000 or more thread in Java programming language. However, the problem is it will slow down the works.
Do you know how the thread works? Thread just creates an environment with the help of the OS so that the application user can feel that the programs are running parallel. But, the actual scenario is different. The computer having a single core processor can handle one operation at an instant. Our OS just sends operations one after one, so that we the user can feel that they are running in parallel.
For example, let us consider a three threaded application. Each of its thread has a for loop. First thread adds numbers inside the loop and keep the result in a variable named result1, second thread multiplies numbers inside the loop and keep the result in a variable named result2, and third thread subtracts numbers inside the loop and keep the result in a variable named result3.
Now, if all these threads are started at the same instant and let all has same priority then the OS will send instructions one after one. If can send a number to add with result1. In the next instant, it can send a number to multiply with result2. In the next instant, it can send a number to subtract with result3. In the next instant, it can again add.
That means, actually a single core processor cannot compute three computation simultaneously. It compute one computation and make pause the remaining ones and go through in this way.
I think, now you understand why running 1000 thread will slow down the whole process. If the performance is not issue in the mentioned task and you just need the output you can run 1000+ thread.
But, if you need an improved performance you have to think something else. Have you hard about Map Reduce? Implementing map reduce by Hadoop framework you can get better performance in these type of issues. However, first you have to design your problem in map reduce frame. And this framework will compute you task in parallel with the help of more than one computers.
Another solution can be setting priority. In java you can set priorities in a thread.You can give critical tasks high priorities than simpler tasks. If your problem's tasks can be distinguished by high and low priorities tasks it will make the performance better definitely.

Related

How come my Java program goes faster when I execute a number of parallel threads superior to Runtime.getRuntime().availableProcessors())?

I have done some research on how to execute parallel threads in Java.
I have found an easy solution, which is to use ExecutorService class.
It is basically used by calling the following:
ExecutorService es = Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors());
Each thread executes a simple task, such as a System.out.println().
I have been told that Runtime.getRuntime().availableProcessors() returns the number of processors, which is basically the number of execution engines capable of running your code, physically distinct or logical processors, if using hyper-threading.
However, when I use the following line instead:
ExecutorService es = Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors()*2);
my program actually goes a lot faster (even though I haven't calculated the exact running time, simply observing the obvious change in speed).
How is this possible? Also, if I multiply the number by three, it speeds up even more, even though the speed stops increasing with higher factors.
My computer is a MacBook Pro running on Yosemite, with a 2,2 GHz Intel Core i7 processor.

The simple task (calling System.out.println()) involves interacting with the outside world, and it may be blocked waiting for (say) the display device or a disk to respond.
A task could also be synchronizing with other tasks, and could have to wait for activity on another thread to do something, or wait on updates to shared variables.
The "number of threads ~= number of cores" rule is just a rule of thumb. It is only predictive if the threads are truly independent of each other and external influences. For a real world multi-threaded application, you need to tune the number of threads for the application, the platform and the problem if your aim is to maximize performance.

Running with more worker threads than there are available processors is like having more than six players suited up for a hockey game. If one player needs to leave the ice, there's another ready to leave the bench and take his place.
That's why I asked if any of your threads do I/O. Doing I/O will block a thread until the I/O operation completes. If you have more workers than there are available processors, then when one thread drops out to wait for I/O, another will be ready to take its place and continue using the CPU. You'll get better utilization of the available cycles that way.
Don't forget that paging is I/O too.
I also asked whether there were any other processes running on the machine. Desktop operating systems schedule threads according to some notion of "fairness". I don't know what "fairness" means on Mac OS, but if it means that the OS tries to give each thread a fair share of CPU time (as opposed to giving each process a fair share), then a program that has more threads will get a larger share than a program that has fewer threads. This will only matter if there are other programs that are actually using CPU.

ExecutorService es = Executors.newFixedThreadPool(noOfThreads);
creates a fixed pool of threads with the input number. More threads gives better performance in some scenarios depending the task performed by the threads.
In the following statements
ExecutorService es = Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors());
ExecutorService es = Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors()*2);
You are just using the number of cpu as a multiple but you are not actually explicitly involving the available cpus. Its only the thread number which is making the difference.

What is a Thread? How many threads does my program have? [duplicate]

I have been trying to find a good definition, and get an understanding, of what a thread really is.
It seems that I must be missing something obvious, but every time I read about what a thread is, it's almost a circular definition, a la "a thread is a thread of execution" or " a way to divide into running tasks". Uh uh. Huh?
It seems from what I have read that a thread is not really something concrete, like a process is. It is in fact just a concept. From what I understand of the way this works, a processor executes some commands for a program (which has been termed a thread of execution), then when it needs to switch to processing for some other program for a bit, it stores the state of the program it's currently executing for somewhere (Thread Local Storage) and then starts executing the other program's instructions. And back and forth. Such that, a thread is really just a concept for "one of the paths of execution" of a program that is currently running.
Unlike a process, which really is something - it is a conglomeration of resources, etc.
As an example of a definition that didn't really help me much . . .
From Wikipedia:
"A thread in computer science is short for a thread of execution. Threads are a way for a program to divide (termed "split") itself into two or more simultaneously (or pseudo-simultaneously) running tasks. Threads and processes differ from one operating system to another but, in general, a thread is contained inside a process and different threads in the same process share same resources while different processes in the same multitasking operating system do not."
So am I right? Wrong? What is a thread really?
Edit: Apparently a thread is also given its own call stack, so that is somewhat of a concrete thing.

A thread is an execution context, which is all the information a CPU needs to execute a stream of instructions.
Suppose you're reading a book, and you want to take a break right now, but you want to be able to come back and resume reading from the exact point where you stopped. One way to achieve that is by jotting down the page number, line number, and word number. So your execution context for reading a book is these 3 numbers.
If you have a roommate, and she's using the same technique, she can take the book while you're not using it, and resume reading from where she stopped. Then you can take it back, and resume it from where you were.
Threads work in the same way. A CPU is giving you the illusion that it's doing multiple computations at the same time. It does that by spending a bit of time on each computation. It can do that because it has an execution context for each computation. Just like you can share a book with your friend, many tasks can share a CPU.
On a more technical level, an execution context (therefore a thread) consists of the values of the CPU's registers.
Last: threads are different from processes. A thread is a context of execution, while a process is a bunch of resources associated with a computation. A process can have one or many threads.
Clarification: the resources associated with a process include memory pages (all the threads in a process have the same view of the memory), file descriptors (e.g., open sockets), and security credentials (e.g., the ID of the user who started the process).

A thread is an independent set of values for the processor registers (for a single core). Since this includes the Instruction Pointer (aka Program Counter), it controls what executes in what order. It also includes the Stack Pointer, which had better point to a unique area of memory for each thread or else they will interfere with each other.
Threads are the software unit affected by control flow (function call, loop, goto), because those instructions operate on the Instruction Pointer, and that belongs to a particular thread. Threads are often scheduled according to some prioritization scheme (although it's possible to design a system with one thread per processor core, in which case every thread is always running and no scheduling is needed).
In fact the value of the Instruction Pointer and the instruction stored at that location is sufficient to determine a new value for the Instruction Pointer. For most instructions, this simply advances the IP by the size of the instruction, but control flow instructions change the IP in other, predictable ways. The sequence of values the IP takes on forms a path of execution weaving through the program code, giving rise to the name "thread".

In order to define a thread formally, we must first understand the boundaries of where a thread operates.
A computer program becomes a process when it is loaded from some store into the computer's memory and begins execution. A process can be executed by a processor or a set of processors. A process description in memory contains vital information such as the program counter which keeps track of the current position in the program (i.e. which instruction is currently being executed), registers, variable stores, file handles, signals, and so forth.
A thread is a sequence of such instructions within a program that can be executed independently of other code. The figure shows the concept:
Threads are within the same process address space, thus, much of the information present in the memory description of the process can be shared across threads.
Some information cannot be replicated, such as the stack (stack pointer to a different memory area per thread), registers and thread-specific data. This information suffices to allow threads to be scheduled independently of the program's main thread and possibly one or more other threads within the program.
Explicit operating system support is required to run multithreaded programs. Fortunately, most modern operating systems support threads such as Linux (via NPTL), BSD variants, Mac OS X, Windows, Solaris, AIX, HP-UX, etc. Operating systems may use different mechanisms to implement multithreading support.
Here, you can find more information about the topic. That was also my information-source.
Let me just add a sentence coming from Introduction to Embedded System by Edward Lee and Seshia:
Threads are imperative programs that run concurrently and share a memory space. They can access each others’ variables. Many practitioners in the field use the term “threads” more narrowly to refer to particular ways of constructing programs that share memory, [others] to broadly refer to any mechanism where imperative programs run concurrently and share memory. In this broad sense, threads exist in the form of interrupts on almost all microprocessors, even without any operating system at all (bare iron).

Processes are like two people using two different computers, who use the network to share data when necessary. Threads are like two people using the same computer, who don't have to share data explicitly but must carefully take turns.
Conceptually, threads are just multiple worker bees buzzing around in the same address space. Each thread has its own stack, its own program counter, etc., but all threads in a process share the same memory. Imagine two programs running at the same time, but they both can access the same objects.
Contrast this with processes. Processes each have their own address space, meaning a pointer in one process cannot be used to refer to an object in another (unless you use shared memory).
I guess the key things to understand are:
Both processes and threads can "run at the same time".
Processes do not share memory (by default), but threads share all of their memory with other threads in the same process.
Each thread in a process has its own stack and its own instruction pointer.

I am going to use a lot of text from the book Operating Systems Concepts by ABRAHAM SILBERSCHATZ, PETER BAER GALVIN and GREG GAGNE along with my own understanding of things.
Process
Any application resides in the computer in the form of text (or code).
We emphasize that a program by itself is not a process. A program is a
passive entity, such as a file containing a list of instructions stored on disk
(often called an executable file).
When we start an application, we create an instance of execution. This instance of execution is called a process.
EDIT:(As per my interpretation, analogous to a class and an instance of a class, the instance of a class being a process. )
An example of processes is that of Google Chrome.
When we start Google Chrome, 3 processes are spawned:
• The browser process is responsible for managing the user interface as
well as disk and network I/O. A new browser process is created when
Chrome is started. Only one browser process is created.
• Renderer processes contain logic for rendering web pages. Thus, they
contain the logic for handling HTML, Javascript, images, and so forth.
As a general rule, a new renderer process is created for each website
opened in a new tab, and so several renderer processes may be active
at the same time.
• A plug-in process is created for each type of plug-in (such as Flash
or QuickTime) in use. Plug-in processes contain the code for the
plug-in as well as additional code that enables the plug-in to
communicate with associated renderer processes and the browser
process.
Thread
To answer this I think you should first know what a processor is. A Processor is the piece of hardware that actually performs the computations.
EDIT: (Computations like adding two numbers, sorting an array, basically executing the code that has been written)
Now moving on to the definition of a thread.
A thread is a basic unit of CPU utilization; it comprises a thread ID, a program
counter, a register set, and a stack.
EDIT: Definition of a thread from intel's website:
A Thread, or thread of execution, is a software term for the basic ordered sequence of instructions that can be passed through or processed by a single CPU core.
So, if the Renderer process from the Chrome application sorts an array of numbers, the sorting will take place on a thread/thread of execution. (The grammar regarding threads seems confusing to me)
My Interpretation of Things
A process is an execution instance. Threads are the actual workers that perform the computations via CPU access. When there are multiple threads running for a process, the process provides common memory.
EDIT:
Other Information that I found useful to give more context
All modern day computer have more than one threads. The number of threads in a computer depends on the number of cores in a computer.
Concurrent Computing:
From Wikipedia:
Concurrent computing is a form of computing in which several computations are executed during overlapping time periods—concurrently—instead of sequentially (one completing before the next starts). This is a property of a system—this may be an individual program, a computer, or a network—and there is a separate execution point or "thread of control" for each computation ("process").
So, I could write a program which calculates the sum of 4 numbers:
(1 + 3) + (4 + 5)
In the program to compute this sum (which will be one process running on a thread of execution) I can fork another process which can run on a different thread to compute (4 + 5) and return the result to the original process, while the original process calculates the sum of (1 + 3).

This was taken from a Yahoo Answer:
A thread is a coding construct
unaffect by the architecture of an
application. A single process
frequently may contain multiple
threads. Threads can also directly
communicate with each other since they
share the same variables.
Processes are independent execution
units with their own state
information. They also use their own
address spaces and can only interact
with other processes through
interprocess communication mechanisms.
However, to put in simpler terms threads are like different "tasks". So think of when you are doing something, for instance you are writing down a formula on one paper. That can be considered one thread. Then another thread is you writing something else on another piece of paper. That is where multitasking comes in.
Intel processors are said to have "hyper-threading" (AMD has it too) and it is meant to be able to perform multiple "threads" or multitask much better.
I am not sure about the logistics of how a thread is handled. I do recall hearing about the processor going back and forth between them, but I am not 100% sure about this and hopefully somebody else can answer that.

A thread is nothing more than a memory context (or how Tanenbaum better puts it, resource grouping) with execution rules. It's a software construct. The CPU has no idea what a thread is (some exceptions here, some processors have hardware threads), it just executes instructions.
The kernel introduces the thread and process concept to manage the memory and instructions order in a meaningful way.

A thread is a set of (CPU)instructions which can be executed.
But in order to have a better understanding of what a thread is, some computer architecture knowledge is required.
What a computer does, is to follow instructions and manipulate data.
RAM is the place where the instructions and data are saved, the processor uses those instructions to perform operations on the saved data.
The CPU has some internal memory cells called, registers. It can perform simple mathematical operations with numbers stored in these registers. It can also move data between the RAM and these registers. These are examples of typical operations a CPU can be instructed to execute:
Copy data from memory position #220 into register #3
Add the number in register #3 to the number in register #1.
The collection of all operations a CPU can do is called instruction set. Each operation in the instruction set is assigned a number. Computer code is essentially a sequence of numbers representing CPU operations. These operations are stored as numbers in the RAM. We store input/output data, partial calculations, and computer code, all mixed together in the RAM.
The CPU works in a never-ending loop, always fetching and executing an instruction from memory. At the core of this cycle is the PC register, or Program Counter. It's a special register that stores the memory address of the next instruction to be executed.
The CPU will:
Fetch the instruction at the memory address given by the PC,
Increment the PC by 1,
Execute the instruction,
Go back to step 1.
The CPU can be instructed to write a new value to the PC, causing the execution to branch, or "jump" to somewhere else in the memory. And this branching can be conditional. For instance, a CPU instruction could say: "set PC to address #200 if register #1 equals zero". This allows computers to execute stuff like this:
if x = 0
compute_this()
else
compute_that()
Resources used from Computer Science Distilled.

The answer varies hugely across different systems and different implementations, but the most important parts are:
A thread has an independent thread of execution (i.e. you can context-switch away from it, and then back, and it will resume running where it was).
A thread has a lifetime (it can be created by another thread, and another thread can wait for it to finish).
It probably has less baggage attached than a "process".
Beyond that: threads could be implemented within a single process by a language runtime, threads could be coroutines, threads could be implemented within a single process by a threading library, or threads could be a kernel construct.
In several modern Unix systems, including Linux which I'm most familiar with, everything is threads -- a process is merely a type of thread that shares relatively few things with its parent (i.e. it gets its own memory mappings, its own file table and permissions, etc.) Reading man 2 clone, especially the list of flags, is really instructive here.

Unfortunately, threads do exist. A thread is something tangible. You can kill one, and the others will still be running. You can spawn new threads.... although each thread is not its own process, they are running separately inside the process. On multi-core machines, 2 threads could run at the same time.
http://en.wikipedia.org/wiki/Simultaneous_multithreading
http://www.intel.com/intelpress/samples/mcp_samplech01.pdf

Just as a process represents a virtual computer, the thread
abstraction represents a virtual processor.
So threads are an abstraction.
Abstractions reduce complexity. Thus, the first question is what problem threads solve. The second question is how they can be implemented.
As to the first question: Threads make implementing multitasking easier. The main idea behind this is that multitasking is unnecessary if every task can be assigned to a unique worker. Actually, for the time being, it's fine to generalize the definition even further and say that the thread abstraction represents a virtual worker.
Now, imagine you have a robot that you want to give multiple tasks. Unfortunately, it can only execute a single, step by step task description. Well, if you want to make it multitask, you can try creating one big task description by interleaving the separate tasks you already have. This is a good start but the issue is that the robot sits at a desk and puts items on it while working. In order to get things right, you cannot just interleave instructions but also have to save and restore the items on the table.
This works, but now it's hard to disentangle the separate tasks by simply looking at the big task description that you created. Also, the ceremony of saving and restoring the items on the tabe is tedious and further clutters the task description.
Here is where the thread abstraction comes in and saves the day. It lets you assume that you have an infinite number of robots, each sitting in a different room at its own desk. Now, you can just throw task descriptions in a pot and everything else is taken care of by the thread abstraction's implementer. Remember? If there are enough workers, nobody has to multitask.
Often it is useful to indicate your perspective and say robot to mean real robots and virtual robot to mean the robots the thread abstraction provides you with.
At this point the problem of multitasking is solved for the case when the tasks are fully independent. However, wouldn't it be nice to let the robots go out of their rooms, interact and work together towards a common goal? Well, as you probably guessed, this requires coordination. Traffic lights, queues - you name it.
As an intermediate summary, the thread abstraction solves the problem of multitasking and creates an opportunity for cooperation. Without it, we only had a single robot, so cooperation was unthinkable. However, it has also brought the problem of coordination (synchronization) on us. Now we know what problem the tread abstraction solves and, as a bonus, we also know what new challenge it creates.
But wait, why do we care about multitasking in the first place?
First, multitasking can increase performance if the tasks involve waiting. For example, while the washing machine is running, you can easily start preparing dinner. And while your dinner is in the over, you can hang out the clothes. Note that here you wait because an independent component does the job for you. Tasks that involve waiting are called I/O bound tasks.
Second, if multitasking is done rapidly, and you look at it from a bird's eyes view, it appears as parallelism. It's a bit like how the human eye perceives a series of still images as motion if shown in quick succession. If I write a letter to Alice for one second and to Bob for one second as well, can you tell if I wrote the two letters simultaneously or alternately, if you only look at what I'm doing every two seconds? Search for Multitasking Operating System for more on this.
Now, let's focus on the question of how the thread abstraction can be implemented.
Essentially, implementing the thread abstraction is about writing a task, a main task, that takes care of scheduling all the other tasks.
A fundamental question to ask is: If the scheduler schedules all tasks and the scheduler is also a task, then who schedules the scheduler?
Let's brake this down. Say you write a scheduler, compile it and load it into the main memory of a computer at the address 1024, which happens to be the address that is loaded into the processor's instruction pointer when the computer is started. Now, your scheduler goes ahead and finds some tasks sitting precompiled in the main memory. For example, a task starts at the address 1,048,576. The scheduler wants to execute this task so it loads the task's address (1,048,576) into the instruction pointer. Huh, that was quite an ill considered move because now the scheduler has no way to regain control from the task it has just started.
One solution is to insert jump instructions to the scheduler (address 1024) into the task descriptions before execution. Actually, you shouldn't forget to save the items on the desk the robot is working at, so you also have to save the processor's registers before jumping. The issue here is that it is hard to tell where to insert the jump instructions. If there are too many, they create too much overhead and if there are too few of them, one task might monopolize the processor.
A second approach is to ask the task authors to designate a few places where control can be transferred back to the scheduler. Note that the authors don't have to write the logic for saving the registers and inserting the jump instruction because it suffices that they mark the appropriate places and the scheduler takes care of the rest. This looks like a good idea because task authors probably know that, for example, their task will wait for a while after loading and starting a washing machine, so they let the scheduler take control there.
The problem that neither of the above approaches solve is that of an erroneous or malicious task that, for example, gets caught up in an infinite loop and never jumps to the address where the scheduler lives.
Now, what to do if you cannot solve something in software? Solve it in hardware! What is needed is a programmable circuitry wired up to the processor that acts like an alarm clock. The scheduler sets a timer and its address (1024) and when the timer runs out, the alarm saves the registers and sets the instruction pointer to the address where the scheduler lives. This approach is called preemptive scheduling.
Probably by now you start to sense that implementing the thread abstraction is not like implementing a linked list. The most well-known implementers of the thread abstraction are operating systems. The threads they provide are sometimes called kernel-level threads. Since an operating system cannot afford losing control, all major, general-purpose operating systems uses preemptive scheduling.
Arguably, operating systems feel like the right place to implement the thread abstraction because they control all the hardware components and can suspend and resume threads very wisely. If a thread requests the contents of a file stored on a hard drive from the operating system, it immediately knows that this operation will most likely take a while and can let another task occupy the processor in the meanwhile. Then, it can pause the current task and resume the one that made the request, once the file's contents are available.
However, the story doesn't end here because threads can also be implemented in user space. These implementers are normally compilers. Interestingly, as far as I know, kernel-level threads are as powerful as threads can get. So why do we bother with user-level threads? The reason, of course, is performance. User-level threads are more lightweight so you can create more of them and normally the overhead of pausing and resuming them is small.
User-level threads can be implemented using async/await. Do you remember that one option to achieve that control gets back to the scheduler is to make task authors designate places where the transition can happen? Well, the async and await keywords serve exactly this purpose.
Now, if you've made it this far, be prepared because here comes the real fun!
Have you noticed that we barely talked about parallelism? I mean, don't we use threads to run related computations in parallel and thereby increase throughput? Well, not quiet.. Actually, if you only want parallelism, you don't need this machinery at all. You just create as many tasks as the number of processing units you have and none of the tasks has to be paused or resumed ever. You don't even need a scheduler because you don't multitask.
In a sense, parallelism is an implementation detail. If you think about it, implementers of the thread abstraction can utilize as many processors as they wish under the hood. You can just compile some well-written multithreaded code from 1950, run it on a multicore today and see that it utilizes all cores. Importantly, the programmer who wrote that code probably didn't anticipate that piece of code being run on a multicore.
You could even argue that threads are abused when they are used to achieve parallelism: Even though people know they don't need the core feature, multitasking, they use threads to get access to parallelism.
As a final thought, note that user-level threads alone cannot provide parallelism. Remember the quote from the beginning? Operating systems run programs inside a virtual computer (process) that is normally equipped with a single virtual processor (thread) by default. No matter what magic you do in user space, if your virtual computer has only a single virtual processor, you cannot run code in parallel.
So what do we want? Of course, we want parallelism. But we also want lightweight threads. Therefore, many implementers of the thread abstraction started to use a hybrid approach: They start as many kernel-level threads as there are processing units in the hardware and run many user-level threads on top of a few kernel-level threads. Essentially, parallelism is taken care of by the kernel-level and multitasking by the user-level threads.
Now, an interesting design decision is what threading interface a language exposes. Go, for example, provides a single interface that allows users to create hybrid threads, so called goroutines. There is no way to ask for, say, just a single kernel-level thread in Go. Other languages have separate interfaces for different kinds of threads. In Rust, kernel-level threads live in the standard library, while user-level and hybrid threads can be found in external libraries like async-std and tokio. In Python, the asyncio package provides user-level threads while multithreading and multiprocessing provide kernel-level threads. Interestingly, the threads multithreading provides cannot run in parallel. On the other hand, the threads multiprocessing provides can run in parallel but, as the library's name suggests, each kernel-level thread lives in a different process (virtual machine). This makes multiprocessing unsuitable for certain tasks because transferring data between different virtual machines is often slow.
Further resources:
Operating Systems: Principles and Practice by Thomas and Anderson
Concurrency is not parallelism by Rob Pike
Parallelism and concurrency need different tools
Asynchronous Programming in Rust
Inside Rust's Async Transform
Rust's Journey to Async/Await
What Color is Your Function?
Why goroutines instead of threads?
Why doesn't my program run faster with more CPUs?
John Reese - Thinking Outside the GIL with AsyncIO and Multiprocessing - PyCon 2018

Java threads are not actually executed in parallel?

Until now I was under the impression that 2 threads that start in the same time are also executed in parallel (both running their piece of codes in the same time), but I read some documentation recently and I understood that they actually take turns on the execution of their code, so there is no piece of code for first thread executed in the same time as a piece of code from the second thread.
Is my understanding correct?
If yes, then how multi-threading is faster then one thread execution?
I'm asking this because the only difference is that a single thread executes the code sequential, while multithreading can take turns on the execution, but still should take the same amount of time since it's nothing done in parallel

a) on multi-processor machines, threads can actually run in parallel (one per CPU)
b) If your thread calls Thread.sleep() while waiting for IO etc., it makes resources available to other threads. So multi-threaded applications are actually faster than single-threaded ones when dealing with external resources

Java threads are executed in parallel if there are enough CPUs available for a JVM. You can't run 2 computations on a machine with a single computing element at the same time, so this computing element is used either by first, or by second computation at any given time. Probably what you've read concerned this circumstances.

No, Java threads are executed in parallel (unlike some other platforms like CPython). However, whether that gives performance improvements depends on the code you execute.
If you test with easily parallelizable & CPU intensive tasks like calculating PI with a parallelizable algorithm or resizing lots of images etc., you can easily demonstrate that performance can be increased basically linearly (if you have 2 CPUs = x2, 4 CPUs = x4 etc.)
EDIT:
When you only have one CPU, multi-threading is still beneficial. For example, you can have one thread reading images from the disk while the other thread resizes the images. This will also improve the performance because you can utilize the CPU without waste.
EDIT2:
When you read and resize images (note the plural) in a single thread, then you will see that CPU usage won't be 100% at all times. This is because while the thread is reading from file, it can't perform the resizing. If you had more than one thread, by the time a resize has finished another file would have been ready in-memory. If you are dealing with big images, it's relatively easy to peg the CPU at 100% with this design.

Well the answer of you question depends on the number of CPU a system has .
Keep in mind that a single CPU can process only one thread at a time but the context switching between the threads is so fast that it seems that the threads are running concurrently.
On your second question If yes, then how multi-threading is faster then one thread execution?
Mutlithreading utilizes the CPU cycles . Say if one thread is blocked on some resource , other threads might get a chance to run .
On a side note , go through this blog page if you want to see some basic multithreading tutorials http://javasolutionsonline.blogspot.in/p/java-concurrency.html

umm..threads do run in parallel...but not in your conventional pcs that had single cores..
if you have a multi core chip or many CPUs , then they can run in parallel..
imagine one thread running on every of the quad-cores...
thread give u many other advantages as well , as you must already know

Is concurrent programming more grided or clustered?

I'm trying to wrap my brain around parallel/concurrent programming (in Java) and am getting hung up on some fundamentals that don't seem to be covered in any of the tutorials I've been reading.
When we talk about "multi-threading", or "parallel/concurrent programming", does that mean we're taking a big problem and spreading it over many threads, or are we first explicitly decomposing it into smaller sub-problems, and passing each sub-problem to its own thread?
For example, let's say we have EndWorldHungerTask implements Runnable, and task accomplishes some enormous problem. In order to complete its objective, it has to do some really heavy lifting, say, a hundred million times:
public class EndWorldHungerTask implements Runnable {
public void run() {
for(int i = 0; i < 100000000; i++)
someReallyExpensiveOperation();
}
}
In order to make this "concurrent" or "multi-threaded", would we pass this EndWorldHungerTask to, say, 100 worker threads (where each of the 100 workers are told by the JVM when to be active and work on the next iteration/someReallyExpensiveOperation() call), or would we refactor it manually/explicitly so that each of the 100 workers is iterating over different parts of the loop/work-to-be-done? In both cases, each of the 100 workers is only iterating a million times.
But, under the first paradigm, Java is telling each Thread when to execute. Under the second, the developer needs to manually (in the code) partition the problem ahead of time, and assign each sub-problem to a new Thread.
I guess I'm asking how its "normally done" in Java land. And, not just for this problem, but in general.

I guess I'm asking how its "normally done" in Java land. And, not just for this problem, but in general.
This is highly dependent on the task at hand.
The standard paradigm in Java is that you have to split the work into chunks yourself. Distributing those chunks across multiple threads/cores is a separate problem, and there exist a variety of patterns for that (queues, thread pools, etc).
It is interesting to note that there exist frameworks that can automatically make use of multiple cores to execute things like for loops in parallel (for example, OpenMP). However, I am not aware of any such frameworks for Java.
Finally, it could be the case that the low-level library that does the bulk of the work can make use of multiple cores. In such a case the higher-level code may be able to remain single-threaded and still benefit from multicore hardware. One example might be numerical code using MKL under the covers.

When we talk about "multi-threading", or "parallel/concurrent programming", does that mean we're taking a big problem and spreading it over many threads, or are we first explicitly decomposing it into smaller sub-problems, and passing each sub-problem to its own thread?
I think this depends highly on the problem. There are times where you have the same task that you call 1000s or millions of times using the same code. This is the ExecutorSerivce.submit() type of pattern. You has million of lines from a file and you are running some processing methods on each line. I guess this is your "spreading it over many threads" type of problem. This works for simple thread models.
But there are other cases where the problem space is made up of a large number of non-homogenous tasks. Sometimes you might spawn a single thread to handle some background keep-alive, and other times a thread pool here and there to process some queue of work. Typically the larger the scope of the problem, the more complicated the concurrency model and the more different types of pools and threads are used. I guess this is your "decomposing it into smaller sub-problems" type.
In order to make this "concurrent" or "multi-threaded", would we pass this EndWorldHungerTask to, say, 100 worker threads (where each of the 100 workers are told by the JVM when to be active and work on the next iteration/someReallyExpensiveOperation() call), or would we refactor it manually/explicitly so that each of the 100 workers is iterating over different parts of the loop/work-to-be-done? In both cases, each of the 100 workers is only iterating a million times.
In your case, I don't see how you can solve world hunger (to use your analogy) with one set of thread code. I think that you have to "decompose it into smaller sub-problems" which corresponds to the latter case that I explain above: a whole series of threads running different code. Some of the sub-solutions can be done in thread-pools and some will be done with individual threads, each running separate code.
I guess I'm asking how its "normally done" in Java land. And, not just for this problem, but in general.
"Normally" depends highly on the problem and its complexity. In my experience, I normally use the ExecutorService constructs as much as possible. But with any decent sized problem you will find yourself with a number of different thread-pools, Spring timer threads, custom one-off thread tasks, producer/consumer models, etc., etc..

Normally you would want each thread to execute one task form start to finish, you would gain nothing from leaving the task half done, then halting execution on that thread and "calling" another thread to finish the job. Java offers of course tools for this kind of thread synchronization, but they are really used when a task is depending on another task to complete - not so that another thread may complete the task.
Most of the time you will have a big problem, that consists of several tasks, if this tasks can be executed concurrently then it would make sense to spawn threads to execute this tasks. There is an overhead associated with creating threads, so if all the tasks are sequential and must wait for the other to finish, then it would not be beneficial at all to spawn multiple threads, just one thread so you don't block the main thread.

"multi-threading" <> "parallel/concurrent programming".
Multithreaded apps are often written to take advantage of the high I/O performance of a preemptive multitasker. An example might be a web crawler/downloader. A multithreaded crawler would typically outperform a single-threaded version by a huge factor, even when running on a box with only one CPU core. The actions of a DNS query to get a site address, connecting to the site, downloading a page, writing it to a disk file are all operations that require little CPU but a lot of IO waiting. So, a lot of these unavoidable waits can be performed in parallel by many threads. When a DNS query comes in, an HTTP client connects or a disk operation is complete, the thread that requested it is made ready/running and can move on to the next operation.
The vast majority of apps are, primarily, written as multithreaded for this reason. That's why the box I'm writing this on has 98 processes, (of which 94 have more than one thread), 1360 threads and 3% CPU use - it's got little to do with splitting CPU work up across cores - it's mostly about IO performance.
Parallel/concurrent programming can actually take place with multiple CPU cores. For those apps that have CPU-intensive work that can be decomposed into largish packages for distribution across cores, a speedup factor approaching the number of cores is possible with care.
Naturally there is some bleedover - the I/O bound web-crawler will tend to perform better on a box with more cores, if only because the interrupt/driver overhead has a smaller impact on overall performance, but it wont be better by much.
It doesn't matter how many workers you have available for the EndWorldHunger Task if they are all waiting for the crops to grow.

How can I make sure N threads run at roughly the same speed?

I'm toying with the idea of writing a physics simulation software in which each physical element would be simulated in its own thread.
There would be several advantages to this approach. It would be conceptually very close to how the real world works. It would be much easier to scale the system to multiple machines.
However, for this to work I need to make sure that all threads run at the same speed, with a rather liberal interpretation of 'same'. Say within 1% of each others.
That's why I don't necessarily need a Thread.join() like solution. I don't want some uber-controlling school mistress that ensures all threads regularly synchronize with each others. I just need to be able to ask the runtime (whichever it is---could be Java, Erlang, or whatever is most appropriate for this problem) to run the threads at a more or less equal speed.
Any suggestions would be extremely appreciated.
UPDATE 2009-03-16
I wanted to thank everyone who answered this question, in particular all those whose answer was essentially "DON'T DO THIS". I understand my problem much better now thanks to everybody's comments and I am less sure I should continue as I originally planned. Nevertheless I felt that Peter's answer was the best answer to the question itself, which is why I accepted it.

You can't really do this without coordination. What if one element ended up needing cheaper calculations than another (in a potentially non-obvious way)?
You don't necessarily need an uber-controller - you could just keep some sort of step counter per thread, and have a global counter indicating the "slowest" thread. (When each thread has done some work, it would have to check whether it had fallen behind the others, and update the counter if so.) If a thread notices it's a long way ahead of the slowest thread, it could just wait briefly (potentially on a monitor).
Just do this every so often to avoid having too much overhead due to shared data contention and I think it could work reasonably well.

You'll need some kind of synchronization. CyclicBarrier class has what you need:
A synchronization aid that allows a
set of threads to all wait for each
other to reach a common barrier point.
CyclicBarriers are useful in programs
involving a fixed sized party of
threads that must occasionally wait
for each other. The barrier is called
cyclic because it can be re-used after
the waiting threads are released.
After each 'tick', you can let all your threads to wait for others, which were slower. When remaining threads reach the barrier, they all will continue.

Threads are meant to run completely independent of each other, which means synchronizing them in any way is always a pain. In your case, you need a central "clock" because there is no way to tell the VM that each thread should get the same amount of ... uh ... what should it get? The same amount of RAM? Probably doesn't matter. The same amount of CPU? Are all your objects so similar that each needs the same number of assembler instructions?
So my suggestion is to use a central clock which broadcasts clock ticks to every process. All threads within each process read the ticks (which should be absolute), calculate the difference to the last tick they saw and then update their internal model accordingly.
When a thread is done updating, it must put itself to sleep; waiting for the next tick. In Java, use wait() on the "tick received" lock and wake all threads with "notifyAll()".

I'd recommend not using threads wherever possible because they just add problems later if you're not careful. When doing physics simulations you could use hundreds of thousands of discrete objects for larger simulations. You can't possibly create this many threads on any OS that I know of, and even if you could it would perform like shit!
In your case you could create a number of threads, and put an event loop in each thread. A 'master' thread could sequence the execution and post a 'process' event to each worker thread to wake it up and make it do some work. In that way the threads will sleep until you tell them to work.
You should be able to get the master thread to tick at a rate that allows all your worker threads to complete before the next tick.
I don't think threads are the answer to your problem, with the exception of parallelising into a small number of worker threads (equal to the number of cores in the machine) which each linearly sequence a series of physical objects. You could still use the master/event-driven approach this way, but you would remove a lot of the overhead.

Please don't. Threads are an O/S abstraction permitting the appearance of parallel execution. With multiple and multicore CPU's, the O/S can (but need not) distribute threads among the different cores.
The closest thing to your scalability vision which I see as workable is to use worker threads, dimensioned to roughly match the number of cores you have, and distribute work among them. A rough draft: define a class ActionTick which does the updating for one particle, and let the worker thread pick ActionTicks to process from a shared queue. I see several challenges even with such a solution.
Threading overheads: you get context switching overhead among different worker threads. Threads by themselves are expensive (if not actually as ruinous as processes): test performance with different thread pool sizes. Adding more threads beyond the number of cores tends to reduce performance!
Synchronization costs: you get several spots of contention: access to the work queue for one, but worse, access to the simulated world. You need to delimit the effects of each ActionTick or implement a lot of locking/unlocking.
Difficulty of optimizating the physics. You want to delimit the number of objects/particles each ActionTick looks at (distance cut-off? 3D-tree-subdivision of the simulation space?). Depending on the simulation domain, you may be able to eliminate a lot of work by examining whether any changes is even needed in a subset of items. Doing these kinds of optimizations is easier before queueing work items, rather than as a distributed algorithm. But then that part of your simulation becomes a potential scalability bottleneck.
Complexity. Threading and concurrency introduces several cans of worms to a solution. Always consider other options first -- but if you need them, try threads before creating your own work item scheduling, locking and execution strategies...
Caveat: I haven't worked with any massive simulation software, just some hobbyist code.

As you mention, there are many "DON'T DO THIS" answers. Most seem to read threads as OS threads used by Java. Since you mentioned Erlang in your post, I'd like to post a more Erlang-centered answer.
Modeling this kind of simulation with processes (or actors, micro threads, green threads, as they are sometimes called) doesn't necessarily need any synchronization. In essence, we have a couple of (most likely thousands or hundreds of thousands) physics objects that need to be simulated. We want to simulate these objects as realistically as possible, but there is probably also some kind of real time aspect involved (doesn't have to be though, you don't mention this in your question).
A simple solution would be to spawn of an Erlang process for each object, sent ticks to all of them and collect the results of the simulation before proceeding with the next tick. This is in practice synchronizing everything. It is of course more of a deterministic solution and does not guarantee any real time properties. It is also non-trivial how the processes would talk to each other to get the data they need for the calculations. You probably need to group them in clever ways (collision groups etc), have hibernated processes (which Erlang has neat support for) for sleeping objects, etc to speed things up.
To get real time properties you probably need to restrain the calculations performed by the processes (trading accuracy for speed). This could perhaps be done by sending out ticks without waiting for answers, and letting the object processes reply back to each tick with their current position and other data you need (even though it might only be approximated at the time). As DJClayworth says, this could lead to errors accumulating in the simulation.
I guess in one sense, the question is really about if it is possible to use the strength of concurrency to gain some kind of advantage here. If you need synchronization, it is a quite strong sign that you do not need concurrency between each physics object. Because you essentially throw away a lot of computation time by waiting for other processes. You might use concurrency during calculation but that is another discussion, I think.
Note: none of these ideas take the actual physics calculations into account. This is not Erlang strong side and could perhaps be performed in a C library or whatever strikes your fancy, depending on the type of characteristics you want.
Note: I do not know of any case where this has been done (especially not by me), so I cannot guarantee that this is sound advice.

Even with perfect software, hardware will prevent you doing this. Hardware threads typically don't have fair performance. Over a short period, you are lucky if threads run within +-10% performance.
The are, of course, outliers. Some chipsets will run some cores in powersaving mode and others not. I believe one of the Blue Gene research machines had software controlled scheduling of hardware threads instead of locks.

Erlang will by default try and spread its processes evenly over the available threads. It will also by default try to run threads on all available processors. So if you have enough runnable Erlang processes then you will get a relatively even balance.

I'm not a threading expert, but isn't the whole point of threads that they are independent from each other - and non-deterministic?

I think you have a fundamental misconception in your question where you say:
It would be conceptually very close to how the real world works
The real world does not work in a thread-like way at all. Threads in most machines are not independent and not actually even simultaneous (the OS will use context-switching instead). They provide the most value when there is a lot of IO or waiting occurring.
Most importantly, the real-world does not "consume more resources" as more complex things happen. Think of the difference between two objects falling from a height, one falling smoothly and the other performing some kind of complex tumbling motion...

I would make a kind of "clock generator" - and would register every new object/thread there. The clock will notify all registered objects when the delta-t has passed.
However this does not mean you need a separate thread for every object. Ideally you will have as many threads as processors.
From a design point of you could separate the execution of the object-tasks through an Executor or a thread-pool, e.g. when an object receives the tick event, it goes to a thread pool and schedules itself for execution.

Two things has to happen in order to achieve this. You have to assure thah you have equal number of threads per CPU core, and you need some kind of synchronization.
That sync can be rather simple, like checking "cycle-done" variable for each thread while performing computation, but you can't avoid it.

Working at control for motors i have used some math to maintain velocity at stable state.
The system have PID control, proportional, integral and derivative. But this is analog/digital system. Maybe can use similarly to determine how mush time each thread must run, but the biggest tip I can give you is that all threads will each have a clock synchronization.

I'm first to admit I'm not a threading expert, but this sounds like a very wrong way to approach simulation. As others have already commented having too many threads is computationally expensive. Furthermore, if you are planing to do what I think you are thinking of doing, your simulation may turn out to produce random results (may not matter if you are making a game).
I'd go with a few worker threads used to calculate discrete steps of the simulation.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.