What of these two multithrading layouts is better?

What of these two multithrading layouts is better? - java

I have a multithread layout where there is a manager object and a lot of workers objects.
I have doubt in which layout is better to use:
1 - The workers run in a loop and ask for a "new job" to the manager
constantly after finished.
or
2 - The manager give new jobs to the workers after they finish each
job.
Are there any recommendations for this?

This is a question I have wrestled with many times. Each time I have chosen for the specific situation I am coding for. You should do the same.
However, to chose correctly you must study the two approaches carefully.
Consider a test case.
You have thousands of files to process.
1. The workers are in control.
The manager becomes a queue of all of the files to process. You create a fixed number of worker threads which request the next file from the manager and repeat until the list is exhausted.
Consequences
You usually end up having to synchronize access to the queue.
You can tinker with the number of workers to attain maximal throughput for your hardware architecture.
Sometimes you can dynamically adjust the number of workers depending on the current load but this can be tricky. If successful you can often achieve an exceptionally optimal solution.
2. The manager is in control.
The manager creates a new Callable for every file and adds it to an Executor controlled thread pool.
Consequences
Well ... just about the same if you think about it. The only difference really is that the executor does the queueing.
There is less synchronization required (except of course internally in the Executor).
Dynamically adjusting the number of threads is not trivial but I expect one could subclass the Executor to achieve this.
In summary
The two architectures are very nearly the same. A number of threads process a sequence of items in parallel.
The differences are more in the dynamics and the footprint.
When the workers are in control, a known number of objects are present at any time. An extensive queue can build up but these would presumably be small objects. Work is done at a fixed and predicable pace. If the work starts to pile up you have to make a special effort to deal with it.
When the manager is in control there can be an explosion of workers, most of which are just sitting around waiting for the Executor. Essentially, the Executor becomes the manager and the Thread pool holds the workers.
I personally prefer the workers being in control. Mostly I suppose because given two essentially similar architectures I normally prefer the one with the most predictable footprint. I plan to experiment.

They really aren't THAT different in principle. I think it really comes down to how you go about implementing the logic to do either of these things. I could see that making more of a difference than which you wind up going with.
The key two part 2 though, is that the manager would need to know if the worker has finished a job. So really at that point the workers still need to tell the manager, which is pretty much the same thing as asking for a new job.
I think it really comes down to how you plan to do IPC. In theory I think the 2nd one is the better option, but it depends how elegantly you make it work.

I would use a BlockingQueue. The manager adds jobs to the queue in a loop, and the workers take jobs from the queue and do them in a loop.

Related

Scheduled Job Task

Subject:
I’m trying to implement a basic job scheduling in Java to handle recurrent persisted scheduled task (for a personal learn project). I don’t want to use any (ready-to-use) libraries like Quartz/Obsidian/Cron4J/etc.
Objective:
Job have to be persistent (to handle server shutdown)
Job execution time can take up to ~2-5 mn.
Manage a large amount of job
Multithread
Light and fast ;)
All my job are in a MySQL Database.
JOB_TABLE (id, name, nextExecution,lastExecution, status(IDLE,PENDING,RUNNING))
Step by step:
Retrieve each job from “JOB_TABLE” where “nextExecution > now” AND “status = IDLE“. This step is executed every 10mn by a single thread.
For each job retrieved, I put a new thread in a ThreadPoolExecutor then I update the job status to “PENDING” in my “JOB_TABLE”.
When the job thread is running, I update the job status to “RUNNING”.
When the job is finished, I update the lastExecution with current time, I set a new nextExecution time and I change the job status to “IDLE”.
When server is starting, I put each PENDING/RUNNING job in the ThreadPoolExecutor.
Question/Observation:
Step 2 : Will the ThreadPoolExecutor handle a large amount of thread (~20000) ?
Should I use a NoSQL solution instead of MySQL ?
Is it the best solution to deal with such use case ?
This is a draft, there is no code behind. I’m open to suggestion, comments and criticism!

I have done similar to your task on a real project, but in .NET. Here is what I can recall regarding your questions:
Step 2 : Will the ThreadPoolExecutor handle a large amount of thread (~20000)?
We discovered that .NET's built-in thread pool was the worst approach, as the project was a web application. Reason: the web application relies on the built-in thread pool (which is static and thus shared for all uses within the running process) to run each request in separate thread, while maintain effective recycling of threads. Employing the same thread pool for our internal processing was going to exhaust it and leave no free threads for the user requests, or spoil their performance, which was unacceptable.
As you seem to be running quite a lot of jobs (20k is a lot for a single machine) then you definitely should look for a custom thread pool. No need to write your own though, I bet there are ready solutions and writing one is far beyond what your study project would require* see the comments (if I understand correctly you are doing a school or university project).
Should I use a NoSQL solution instead of MySQL?
Depends. You obviously need to update the job status concurrently, thus, you will have simultaneous access to one single table from multiple threads. Databases can scale pretty well to that, assuming you did your thing right. Here is what I refer to doing this right:
Design your code in a way that each job will affect only its own subset of rows in the database (this includes other tables). If you are able to do so, you will not need any explicit locks on database level (in the form of transaction serialization levels). You can even enforce a liberal serialization level that may allow dirty or phantom reads - that will perform faster. But beware, you must carefully ensure no jobs will concur over the same rows. This is hard to achieve in real-life projects, so you should probably look for alternative approaches in db locking.
Use appropriate transaction serialization mode. The transaction serialization mode defines the lock behavior on database level. You can set it to lock the entire table, only the rows you affect, or nothing at all. Use it wisely, as any misuse could affect the data consistency, integrity and the stability of the entire application or db server.
I am not familiar with NoSQL database, so I can only advice you to research on the concurrency capabilities and map them to your scenario. You could end up with a really suitable solution, but you have to check according to your needs. From your description, you will have to support simultaneous data operations over the same type of objects (what is the analog for a table).
Is it the best solution to deal with such use case ?
Yes and No.
Yes, as you will encounter one of the difficult tasks developers are facing in real world. I have worked with colleagues having more than 3 times my own experience and they were more reluctant to do multi-threading tasks than me, they really hated that. If you feel this area is interesting to you, play with it, learn and improve as much as you have to.
No, because if you are working on a real-life project, you need something reliable. If you have so many questions, you will obviously need time to mature and be able to produce a stable solution for such a task. Multi-threading is a difficult topic for many reasons:
It is hard to debug
It introduces many points of failure, you need to be aware of all of them
It could be a pain for other developers to assist or work with your code, unless you sticked to commonly accepted rules.
Error handling can be tricky
Behavior is unpredictable / undeterministic.
There are existing solutions with high level of maturity and reliability that are the preferred approach for real projects. Drawback is that you will have to learn them and examine how customizable they are for your needs.
Anyway, if you need to do it your way, and then port your achievement to a real project, or a project of your own, I can advice you to do this in a pluggable way. Use abstraction, programming to interfaces and other practices to decouple your own specific implementation from the logic that will set the scheduled jobs. That way, you can adapt your api to an existing solution if this becomes a problem.
And last, but not least, I did not see any error-handling predictions on your side. Think and research on what to do if a job fails. At least add a 'FAILED' status or something to persist in such case. Error handling is tricky when it comes to threads, so be thorough on your research and practices.
Good luck

You can declare the maximum pool size with ThreadPoolExecutor#setMaximumPoolSize(int). As Integer.MAX is larger 20000 then technically yes it can.
The other question is that does your machine wold support so many thread to run. You will have provide enough RAM so each tread will allocate on stack.
Thee should not be problem to address ~20,000 threads on modern desktop or laptop but on mobile device it could be an issue.
From doc:
Core and maximum pool sizes
A ThreadPoolExecutor will automatically
adjust the pool size (see getPoolSize()) according to the bounds set
by corePoolSize (see getCorePoolSize()) and maximumPoolSize (see
getMaximumPoolSize()). When a new task is submitted in method
execute(java.lang.Runnable), and fewer than corePoolSize threads are
running, a new thread is created to handle the request, even if other
worker threads are idle. If there are more than corePoolSize but less
than maximumPoolSize threads running, a new thread will be created
only if the queue is full. By setting corePoolSize and maximumPoolSize
the same, you create a fixed-size thread pool. By setting
maximumPoolSize to an essentially unbounded value such as
Integer.MAX_VALUE, you allow the pool to accommodate an arbitrary
number of concurrent tasks. Most typically, core and maximum pool
sizes are set only upon construction, but they may also be changed
dynamically using setCorePoolSize(int) and setMaximumPoolSize(int).
More
About the DB. Create a solution that is not depend to DB structure. Then you can set up two enviorements and measure it. Start with the technology that you know. But keep open to other solutions. At the begin the relations DB should keep up with the performance. And if you mange it properly the it should not be an issue later. The NoSQL are used to work with really big data. But the best for you is to create both and run some performace tests.

Thread Pool vs Many Individual Threads

I'm in the middle of a problem where I am unable decide which solution to take.
The problem is a bit unique. Lets put it this way, i am receiving data from the network continuously (2 to 4 times per second). Now each data belongs to a different, lets say, group.
Now, lets call these groups, group1, group2 and so on.
Each group has a dedicated job queue where data from the network is filtered and added to its corresponding group for processing.
At first I created a dedicated thread per group which would take data from the job queue, process it and then goes to blocking state (using Linked Blocking Queue).
But my senior suggested that i should use thread pools because this way threads wont get blocked and will be usable by other groups for processing.
But here is the thing, the data im getting is fast enough and the time a thread takes to process it is long enough for the thread to, possibly, not go into blocking mode. And this will also guarantee that data gets processed sequentially (job 1 gets done before job 2), which in pooling, very little chances are, might not happen.
My senior is also bent on the fact that pooling will also save us lots of memory because threads are POOLED (im thinking he really went for the word ;) ). While i dont agree to this because, i personally think, pooled or not each thread gets its own stack memory. Unless there is something in thread pools which i am not aware of.
One last thing, I always thought that pooling helps where jobs appear in a big number for short time. This makes sense because thread spawning would be a performance kill because of the time taken to init a thread is lot more than time spent on doing the job. So pooling helps a lot here.
But in my case group1, group2,...,groupN always remain alive. So if there is data or not they will still be there. So thread spawning is not the issue here.
My senior is not convinced and wants me to go with the pooling solution because its memory footprint is great.
So, which path to take?
Thank you.

Good question.
Pooling indeed saves you initialization time, as you said. But it has another aspect: resource management. And here I am asking you this- just how many groups (read- dedicated threads) do you have?
do they grow dynamically during the execution span of the application?
For example, consider a situation where the answer to this question is yes. new Groups types are added dynamically. In this case, you might not want to dedicate a a thread to each one since there is technically no restrictions on the amount of groups that will be created, you will create a lot of threads and the system will be context switching instead of doing real work.
Threadpooling to the rescue- thread pool allows you to specify a restriction on the maxumal number of threads that could be possibly created, with no regard to load. So the application may deny service from certain requests, but the ones that get through are handled properly, without critically depleting the system resources.
Considering the above, I is very possible that in your case, it is very much OK to have a dedicated
thread for each group!
The same goes for your senior's conviction that it will save memory.. Indeed, a thread takes up memory on the heap, but is it really so much, if it is a predefined amount, say 5. Even 10- it is probably OK. Anyway, you should not use pooling unless you are a-priory and absolutely convinced that you actually have a problem!
Pooling is a design decision, not an architectural one. You can not-pool at the beggining and proceed with optimizations in case you find pooling to be beneficial after you encountered a performance issue.
Considering the serialization of requests (in order execution) it is no matter whether you are using a threadpool or a dedicated thread. The sequential execution is a property of the queue coupled with a single handler thread.

Creating a thread will consume resources, including the default stack per thread (IIR 512Kb, but configurable). So the advantage to pooling is that you incur a limited resource hit. Of course you need to size your pool according to the work that you have to perform.
For your particular problem, I think the key is to actually measure performance/thread usage etc. in each scenario. Unless your running into constraints I perhaps wouldn't worry either way, other than to make sure that you can swap one implementation for another without a major impact on your application. Remember that premature optimisation is the root of all evil. Note that:
"Premature optimization" is a phrase used to describe a situation
where a programmer lets performance considerations affect the design
of a piece of code. This can result in a design that is not as clean
as it could have been or code that is incorrect, because the code is
complicated by the optimization and the programmer is distracted by
optimizing.

Is concurrent programming more grided or clustered?

I'm trying to wrap my brain around parallel/concurrent programming (in Java) and am getting hung up on some fundamentals that don't seem to be covered in any of the tutorials I've been reading.
When we talk about "multi-threading", or "parallel/concurrent programming", does that mean we're taking a big problem and spreading it over many threads, or are we first explicitly decomposing it into smaller sub-problems, and passing each sub-problem to its own thread?
For example, let's say we have EndWorldHungerTask implements Runnable, and task accomplishes some enormous problem. In order to complete its objective, it has to do some really heavy lifting, say, a hundred million times:
public class EndWorldHungerTask implements Runnable {
public void run() {
for(int i = 0; i < 100000000; i++)
someReallyExpensiveOperation();
}
}
In order to make this "concurrent" or "multi-threaded", would we pass this EndWorldHungerTask to, say, 100 worker threads (where each of the 100 workers are told by the JVM when to be active and work on the next iteration/someReallyExpensiveOperation() call), or would we refactor it manually/explicitly so that each of the 100 workers is iterating over different parts of the loop/work-to-be-done? In both cases, each of the 100 workers is only iterating a million times.
But, under the first paradigm, Java is telling each Thread when to execute. Under the second, the developer needs to manually (in the code) partition the problem ahead of time, and assign each sub-problem to a new Thread.
I guess I'm asking how its "normally done" in Java land. And, not just for this problem, but in general.

I guess I'm asking how its "normally done" in Java land. And, not just for this problem, but in general.
This is highly dependent on the task at hand.
The standard paradigm in Java is that you have to split the work into chunks yourself. Distributing those chunks across multiple threads/cores is a separate problem, and there exist a variety of patterns for that (queues, thread pools, etc).
It is interesting to note that there exist frameworks that can automatically make use of multiple cores to execute things like for loops in parallel (for example, OpenMP). However, I am not aware of any such frameworks for Java.
Finally, it could be the case that the low-level library that does the bulk of the work can make use of multiple cores. In such a case the higher-level code may be able to remain single-threaded and still benefit from multicore hardware. One example might be numerical code using MKL under the covers.

When we talk about "multi-threading", or "parallel/concurrent programming", does that mean we're taking a big problem and spreading it over many threads, or are we first explicitly decomposing it into smaller sub-problems, and passing each sub-problem to its own thread?
I think this depends highly on the problem. There are times where you have the same task that you call 1000s or millions of times using the same code. This is the ExecutorSerivce.submit() type of pattern. You has million of lines from a file and you are running some processing methods on each line. I guess this is your "spreading it over many threads" type of problem. This works for simple thread models.
But there are other cases where the problem space is made up of a large number of non-homogenous tasks. Sometimes you might spawn a single thread to handle some background keep-alive, and other times a thread pool here and there to process some queue of work. Typically the larger the scope of the problem, the more complicated the concurrency model and the more different types of pools and threads are used. I guess this is your "decomposing it into smaller sub-problems" type.
In order to make this "concurrent" or "multi-threaded", would we pass this EndWorldHungerTask to, say, 100 worker threads (where each of the 100 workers are told by the JVM when to be active and work on the next iteration/someReallyExpensiveOperation() call), or would we refactor it manually/explicitly so that each of the 100 workers is iterating over different parts of the loop/work-to-be-done? In both cases, each of the 100 workers is only iterating a million times.
In your case, I don't see how you can solve world hunger (to use your analogy) with one set of thread code. I think that you have to "decompose it into smaller sub-problems" which corresponds to the latter case that I explain above: a whole series of threads running different code. Some of the sub-solutions can be done in thread-pools and some will be done with individual threads, each running separate code.
I guess I'm asking how its "normally done" in Java land. And, not just for this problem, but in general.
"Normally" depends highly on the problem and its complexity. In my experience, I normally use the ExecutorService constructs as much as possible. But with any decent sized problem you will find yourself with a number of different thread-pools, Spring timer threads, custom one-off thread tasks, producer/consumer models, etc., etc..

Normally you would want each thread to execute one task form start to finish, you would gain nothing from leaving the task half done, then halting execution on that thread and "calling" another thread to finish the job. Java offers of course tools for this kind of thread synchronization, but they are really used when a task is depending on another task to complete - not so that another thread may complete the task.
Most of the time you will have a big problem, that consists of several tasks, if this tasks can be executed concurrently then it would make sense to spawn threads to execute this tasks. There is an overhead associated with creating threads, so if all the tasks are sequential and must wait for the other to finish, then it would not be beneficial at all to spawn multiple threads, just one thread so you don't block the main thread.

"multi-threading" <> "parallel/concurrent programming".
Multithreaded apps are often written to take advantage of the high I/O performance of a preemptive multitasker. An example might be a web crawler/downloader. A multithreaded crawler would typically outperform a single-threaded version by a huge factor, even when running on a box with only one CPU core. The actions of a DNS query to get a site address, connecting to the site, downloading a page, writing it to a disk file are all operations that require little CPU but a lot of IO waiting. So, a lot of these unavoidable waits can be performed in parallel by many threads. When a DNS query comes in, an HTTP client connects or a disk operation is complete, the thread that requested it is made ready/running and can move on to the next operation.
The vast majority of apps are, primarily, written as multithreaded for this reason. That's why the box I'm writing this on has 98 processes, (of which 94 have more than one thread), 1360 threads and 3% CPU use - it's got little to do with splitting CPU work up across cores - it's mostly about IO performance.
Parallel/concurrent programming can actually take place with multiple CPU cores. For those apps that have CPU-intensive work that can be decomposed into largish packages for distribution across cores, a speedup factor approaching the number of cores is possible with care.
Naturally there is some bleedover - the I/O bound web-crawler will tend to perform better on a box with more cores, if only because the interrupt/driver overhead has a smaller impact on overall performance, but it wont be better by much.
It doesn't matter how many workers you have available for the EndWorldHunger Task if they are all waiting for the crops to grow.

Java threading objects

I've created an object of arrays with a size of 1000, they are all threaded so that means 1000 threads are added. Each object holds a socket and 9 more global variables. The whole object consists of 1000 lines of code.
I'm looking for ways to make the program efficient because it lags. CPU use is at 100% everytime I start the program.
I understand that I'm going to have to change the way the program works, but I can't find a good way. Can anyone explain how to achieve this?

It depends on what your threads actually do - are the tasks primarily using CPU or other resources? For CPU intensive tasks, the best strategy is to run as many threads as you have cores, or a few more. For threads which are blocking a lot on e.g. reading files, waiting for the net etc. you can have many more threads than CPUs.
It also depends on how many cores the system has. Obviously the answer is very different for a single processor machine than for a 128-way multiprocessor. The above rules of thumb can give you some estimates, but it is best to make experiments yourself based on these, to figure out the ideal number of threads for your specific setup.
Moreover, since Java5, it is always advisable to use e.g. a ThreadPoolExecutor instead of creating your threads manually. This makes your app both more robust and more flexible.

1/ use thread pool
2/ use futures

You should consider refactor you usage of threads.
1000 Threads normally makes no sense on a normal machine/server although your problem seems to be I/O-heavy. You should consider the number of cpu-threads that are available.
A possible solution would be to use a dispatcher that passes the handling (and possible responding) to a request on the socket into a queue of a ThreadPoolExecutor.

From my experience, 1000 threads are just too many (at least on 8core/8GB RAM machines). A common symptom is context switching slashing, where your OS is just busy jumping from thread to thread while doing little useful work (and a lot of memory is wasted etc.).
If you have to maintain 1000 sockets, you probably have to go for NIO. Easier way out would be closing/opening sockets every time (whether you can do this dependents on the characteristics of your work.).
The way you solve this many thread problem is to use a thread pool, as others note. Instead of extending Thread, code a Runnable instead. This is easier said than done though because you have to maintain state if you need conversation. This commonly involves a ConcurrentMap. I personally tend to put a Handler (which implements Runnable) on this map that should run when the counter party returns a response (the response contains a key everytime). In this case you'd be closing the socket every time. If you use NIO, it's more like coding with Threads in the sense you don't need to identify the counterparty like this, but it has its own complexity.

How can I make sure N threads run at roughly the same speed?

I'm toying with the idea of writing a physics simulation software in which each physical element would be simulated in its own thread.
There would be several advantages to this approach. It would be conceptually very close to how the real world works. It would be much easier to scale the system to multiple machines.
However, for this to work I need to make sure that all threads run at the same speed, with a rather liberal interpretation of 'same'. Say within 1% of each others.
That's why I don't necessarily need a Thread.join() like solution. I don't want some uber-controlling school mistress that ensures all threads regularly synchronize with each others. I just need to be able to ask the runtime (whichever it is---could be Java, Erlang, or whatever is most appropriate for this problem) to run the threads at a more or less equal speed.
Any suggestions would be extremely appreciated.
UPDATE 2009-03-16
I wanted to thank everyone who answered this question, in particular all those whose answer was essentially "DON'T DO THIS". I understand my problem much better now thanks to everybody's comments and I am less sure I should continue as I originally planned. Nevertheless I felt that Peter's answer was the best answer to the question itself, which is why I accepted it.

You can't really do this without coordination. What if one element ended up needing cheaper calculations than another (in a potentially non-obvious way)?
You don't necessarily need an uber-controller - you could just keep some sort of step counter per thread, and have a global counter indicating the "slowest" thread. (When each thread has done some work, it would have to check whether it had fallen behind the others, and update the counter if so.) If a thread notices it's a long way ahead of the slowest thread, it could just wait briefly (potentially on a monitor).
Just do this every so often to avoid having too much overhead due to shared data contention and I think it could work reasonably well.

You'll need some kind of synchronization. CyclicBarrier class has what you need:
A synchronization aid that allows a
set of threads to all wait for each
other to reach a common barrier point.
CyclicBarriers are useful in programs
involving a fixed sized party of
threads that must occasionally wait
for each other. The barrier is called
cyclic because it can be re-used after
the waiting threads are released.
After each 'tick', you can let all your threads to wait for others, which were slower. When remaining threads reach the barrier, they all will continue.

Threads are meant to run completely independent of each other, which means synchronizing them in any way is always a pain. In your case, you need a central "clock" because there is no way to tell the VM that each thread should get the same amount of ... uh ... what should it get? The same amount of RAM? Probably doesn't matter. The same amount of CPU? Are all your objects so similar that each needs the same number of assembler instructions?
So my suggestion is to use a central clock which broadcasts clock ticks to every process. All threads within each process read the ticks (which should be absolute), calculate the difference to the last tick they saw and then update their internal model accordingly.
When a thread is done updating, it must put itself to sleep; waiting for the next tick. In Java, use wait() on the "tick received" lock and wake all threads with "notifyAll()".

I'd recommend not using threads wherever possible because they just add problems later if you're not careful. When doing physics simulations you could use hundreds of thousands of discrete objects for larger simulations. You can't possibly create this many threads on any OS that I know of, and even if you could it would perform like shit!
In your case you could create a number of threads, and put an event loop in each thread. A 'master' thread could sequence the execution and post a 'process' event to each worker thread to wake it up and make it do some work. In that way the threads will sleep until you tell them to work.
You should be able to get the master thread to tick at a rate that allows all your worker threads to complete before the next tick.
I don't think threads are the answer to your problem, with the exception of parallelising into a small number of worker threads (equal to the number of cores in the machine) which each linearly sequence a series of physical objects. You could still use the master/event-driven approach this way, but you would remove a lot of the overhead.

Please don't. Threads are an O/S abstraction permitting the appearance of parallel execution. With multiple and multicore CPU's, the O/S can (but need not) distribute threads among the different cores.
The closest thing to your scalability vision which I see as workable is to use worker threads, dimensioned to roughly match the number of cores you have, and distribute work among them. A rough draft: define a class ActionTick which does the updating for one particle, and let the worker thread pick ActionTicks to process from a shared queue. I see several challenges even with such a solution.
Threading overheads: you get context switching overhead among different worker threads. Threads by themselves are expensive (if not actually as ruinous as processes): test performance with different thread pool sizes. Adding more threads beyond the number of cores tends to reduce performance!
Synchronization costs: you get several spots of contention: access to the work queue for one, but worse, access to the simulated world. You need to delimit the effects of each ActionTick or implement a lot of locking/unlocking.
Difficulty of optimizating the physics. You want to delimit the number of objects/particles each ActionTick looks at (distance cut-off? 3D-tree-subdivision of the simulation space?). Depending on the simulation domain, you may be able to eliminate a lot of work by examining whether any changes is even needed in a subset of items. Doing these kinds of optimizations is easier before queueing work items, rather than as a distributed algorithm. But then that part of your simulation becomes a potential scalability bottleneck.
Complexity. Threading and concurrency introduces several cans of worms to a solution. Always consider other options first -- but if you need them, try threads before creating your own work item scheduling, locking and execution strategies...
Caveat: I haven't worked with any massive simulation software, just some hobbyist code.

As you mention, there are many "DON'T DO THIS" answers. Most seem to read threads as OS threads used by Java. Since you mentioned Erlang in your post, I'd like to post a more Erlang-centered answer.
Modeling this kind of simulation with processes (or actors, micro threads, green threads, as they are sometimes called) doesn't necessarily need any synchronization. In essence, we have a couple of (most likely thousands or hundreds of thousands) physics objects that need to be simulated. We want to simulate these objects as realistically as possible, but there is probably also some kind of real time aspect involved (doesn't have to be though, you don't mention this in your question).
A simple solution would be to spawn of an Erlang process for each object, sent ticks to all of them and collect the results of the simulation before proceeding with the next tick. This is in practice synchronizing everything. It is of course more of a deterministic solution and does not guarantee any real time properties. It is also non-trivial how the processes would talk to each other to get the data they need for the calculations. You probably need to group them in clever ways (collision groups etc), have hibernated processes (which Erlang has neat support for) for sleeping objects, etc to speed things up.
To get real time properties you probably need to restrain the calculations performed by the processes (trading accuracy for speed). This could perhaps be done by sending out ticks without waiting for answers, and letting the object processes reply back to each tick with their current position and other data you need (even though it might only be approximated at the time). As DJClayworth says, this could lead to errors accumulating in the simulation.
I guess in one sense, the question is really about if it is possible to use the strength of concurrency to gain some kind of advantage here. If you need synchronization, it is a quite strong sign that you do not need concurrency between each physics object. Because you essentially throw away a lot of computation time by waiting for other processes. You might use concurrency during calculation but that is another discussion, I think.
Note: none of these ideas take the actual physics calculations into account. This is not Erlang strong side and could perhaps be performed in a C library or whatever strikes your fancy, depending on the type of characteristics you want.
Note: I do not know of any case where this has been done (especially not by me), so I cannot guarantee that this is sound advice.

Even with perfect software, hardware will prevent you doing this. Hardware threads typically don't have fair performance. Over a short period, you are lucky if threads run within +-10% performance.
The are, of course, outliers. Some chipsets will run some cores in powersaving mode and others not. I believe one of the Blue Gene research machines had software controlled scheduling of hardware threads instead of locks.

Erlang will by default try and spread its processes evenly over the available threads. It will also by default try to run threads on all available processors. So if you have enough runnable Erlang processes then you will get a relatively even balance.

I'm not a threading expert, but isn't the whole point of threads that they are independent from each other - and non-deterministic?

I think you have a fundamental misconception in your question where you say:
It would be conceptually very close to how the real world works
The real world does not work in a thread-like way at all. Threads in most machines are not independent and not actually even simultaneous (the OS will use context-switching instead). They provide the most value when there is a lot of IO or waiting occurring.
Most importantly, the real-world does not "consume more resources" as more complex things happen. Think of the difference between two objects falling from a height, one falling smoothly and the other performing some kind of complex tumbling motion...

I would make a kind of "clock generator" - and would register every new object/thread there. The clock will notify all registered objects when the delta-t has passed.
However this does not mean you need a separate thread for every object. Ideally you will have as many threads as processors.
From a design point of you could separate the execution of the object-tasks through an Executor or a thread-pool, e.g. when an object receives the tick event, it goes to a thread pool and schedules itself for execution.

Two things has to happen in order to achieve this. You have to assure thah you have equal number of threads per CPU core, and you need some kind of synchronization.
That sync can be rather simple, like checking "cycle-done" variable for each thread while performing computation, but you can't avoid it.

Working at control for motors i have used some math to maintain velocity at stable state.
The system have PID control, proportional, integral and derivative. But this is analog/digital system. Maybe can use similarly to determine how mush time each thread must run, but the biggest tip I can give you is that all threads will each have a clock synchronization.

I'm first to admit I'm not a threading expert, but this sounds like a very wrong way to approach simulation. As others have already commented having too many threads is computationally expensive. Furthermore, if you are planing to do what I think you are thinking of doing, your simulation may turn out to produce random results (may not matter if you are making a game).
I'd go with a few worker threads used to calculate discrete steps of the simulation.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.