I had this brilliant idea to speed up the time needed for generating 36 files: use 36 threads!! Unfortunately if I start one connection (one j2ssh connection object) with 36 threads/sessions, everything lags way more than if I execute each thread at a time.
Now if I try to create 36 new connections (36 j2ssh connection objects) then each thread has a separate connection to server, either i get out of memory exception (somehow the program still runs, and successfully ends its work, slower than the time when I execute one thread after another).
So what to do? how to find the optimal thread number I should use?
because Thread.activeCount() is 3 before starting mine 36 threads?! i'm using Lenovo laptop Intel core i5.
You could narrow it down to a more reasonable number of threads with an ExecutorService. You probably want to use something near the number of processor cores available, e.g:
int threads = Runtime.getRuntime().availableProcessors();
ExecutorService service = Executors.newFixedThreadPool(threads);
for (int i = 0; i < 36; i++) {
service.execute(new Runnable() {
public void run() {
// do what you need per file here
}
});
}
service.shutdown();
A good practice would be to spawn threads equivalent to the number of cores in your processor. I normally use a Executors.fixedThreadPool(numOfCores) executor service and keep feeding it the jobs from my job queue, simple. :-)
Your Intel i5 has two cores; hyperthreading makes them look like four. So you only get four cores' worth of parallelization; the rest of your threads are time sliced.
Assume 1MB RAM per thread just for thread creation, then add the memory that each thread requires to process the file. That will give you an idea about why you're getting out of memory errors. How big are the files you're dealing with? You can see that you'll have a problem if they're very large to have them in memory at the same time.
I'll assume that the server receiving the files can accept multiple connections, so there's value in trying this.
I'd benchmark with 1 thread and then increase them until I found that the performance curve was flattening out.
Brute force: Profile incrementally. Increase the number of threads gradually and check the performance. As the number to connections is just 36, its should be easy
You need to understand that if you create 36 threads you still have one or two processors and it would be switching between threats most of the time.
I would say you increment the threads a little bit, let's say 6 and see the behavior. And then go from there
One way to tune the numebr of threads to the size of the machine is to use
int processors = Runtime.getRuntime().availableProcessors();
int threads = processors * N; // N could 1, 2 or more depending on what you are doing.
ExecutorService es = Executors.newFixedThreadPool(threads);
First you have to find out where the bottle neck is.
If it is the SSH connection, it usually does not help to open multiple connections in parallel. Better use multiple channels on one connection, if needed.
If it is the disk IO, creating multiple threads writing (or reading) only helps if they are accessing different disks (which is seldom the case). But you could have another thread doing CPU-bound things while you are waiting on your disk IO in one thread.
If it is the CPU, and you have enough idle cores, more threads can help. Even more, if they don't need to access common data. But still, more threads than cores (+ some threads doing IO) does not help. (Also take in mind that usually there are other processes on your server, too.)
Using more threads than the number of cores on your machine is going only to slow down the whole process. It will speed up till you reach this number.
Be sure you don't create more threads than you have processing units or you are likely to create more overhead with context switching than you gain in concurrency. Also remember that you only have 1 HDD and 1 HDD controller as a result, I doubt multithreading is going to help you at all here.
Related
I've Worte a simple multithread java application, The main method just creates 5k threads, each thread will loop over a list having 5M records to process.
My Machine specs:
CPU cores: 12 cores
Memory: 13Gb RAM
OS: Debian 64-bit
My jar is now running, And I use hTop to monitor my application and this is what I can see while its running
And This is how I construct a Thread:
ExecutorService executor = Executors.newCachedThreadPool();
Future<MatchResult> future = executor.submit(() -> {
Match match = new Match();
return match.find(this);
});
Match.class
find(Main main){
// looping over a list of 5M
// process this values and doing some calculations
// send the result back to the caller
// this function has no problem and it just takes a long time to run (~160 min)
}
And now I have some questions:
1- Based on my understanding if I have a multiThreaded process, it'll fully utilize all my cores until the task is completed, so why the work load is only around 0.5 (only half a core is used)?
2- Why my Java app state is "S" (sleeping) while its actually running and filling up the logs file?
3- Why I can only see 2037 threads out of 5k are running (this number was actually less than this and its getting increased over time)
My Target: to utilize all cores and get all this 5k+ done as fast as it can be :)
Based on my understanding if I have a multiThreaded process, it'll fully utilize all my cores until the task is completed.
Your understanding is not correct. There are lots of reasons why cores may not (all) be used in a poorly designed multi-threaded application.
so why the work load is only around 0.5 (only half a core is used)?
A number of possible reasons:
The threads may be deadlocked.
The threads may all be contending for a single lock (or a small number of locks), resulting in most of them waiting.
The threads could all be waiting for I/O; e.g. reading the records from some database.
And those are just some of the more obvious possible reasons.
Given the that your threads are making some progress, I think that explanation #2 is a good fit to your "symptoms".
For what it is worth, creating 5k threads is almost certainly a really bad idea. At most 12 of those could possibly be running at any time. The rest will waiting to run (assuming you resolve the problem that is leading to thread starvation) and tying down memory. The latter has various secondary performance effects.
My Target: to utilize all cores and get all this 5k+ done as fast as it can be :)
Those two goals are probably mutually exclusive :-)
All threads are logging to the same file by a the java.util.Logger.
That is a possibly leading to them all contending for the same lock ... on a something in the logger framework. Or bottlenecking on file I/O for the log file.
Generally speaking logging is expensive. If you want performance, minimize your logging, and for cases where the logging is essential, use a logging framework that doen't introduce a concurrency bottleneck.
The best way to solve this problem is to profile the code and figure ouot where it is spending most of its time.
Guesswork is inefficient.
Thank you guys, I've fixed the problem and now Im having the 12 cores running up to maximum as you see in the picture. :)
I actually tried to run this command jstack <Pid> to see the status of my all running threads in this process ID, and I found that 95% of my threads are actually BLOCKED at the logging line, I did some googling and found that I can use AsynchAppender in log4J so logging will not block the thread
I am writing a utility that must make thousands of network requests. Each request receives only a single, small packet in response (similar to ping), but may take upwards of several seconds to complete. Processing each response completes in one (simple) line of code.
The net effect of this is that the computer is not IO-bound, file-system-bound, or CPU-bound, it is only bound by the latency of the responses.
This is similar to, but not the same as There is a way to determine the ideal number of threads? and Java best way to determine the optimal number of threads [duplicate]... the primary difference is that I am only bound by latency.
I am using an ExecutorService object to run the threads and a Queue<Future<Integer>> to track threads that need to have results retrieved:
ExecutorService executorService = Executors.newFixedThreadPool(threadPoolSize);
Queue<Future<Integer>> futures = new LinkedList<Future<Integer>>();
for (int quad3 = 0 ; quad3 < 256 ; ++quad3) {
for (int quad4 = 0 ; quad4 < 256 ; ++quad4) {
byte[] quads = { quad1, quad2, (byte)quad3, (byte)quad4 };
futures.add(executorService.submit(new RetrieverCallable(quads)));
}
}
... I then dequeue all the elements in the queue and put the results in the required data structure:
int[] result = int[65536]
while(!futures.isEmpty()) {
try {
results[i] = futures.remove().get();
} catch (Exception e) {
addresses[i] = -1;
}
}
My first question is: Is this a reasonable way to track all the threads? If thread X takes a while to complete, many other threads might finish before X does. Will the thread pool exhaust itself waiting for open slots, or will the ExecutorService object manage the pool in such a way that threads that have completed but not yet been processed be moved out of available slots so that other threads my begin?
My second question is what guidelines can I use for finding the optimal number of threads to make these calls? I don't even know order-of-magnitude guidance here. I know it works pretty well with 256 threads, but seems to take roughly the same overall time with 1024 threads. CPU utilization is hovering around 5%, so that doesn't appear to be an issue. With that large a number of threads, what are all the metrics I should be looking at to compare different numbers? Obviously overall time to process the batch, average time per thread... what else? Is memory an issue here?
It will shock you, but you do not need any threads for I/O (quantitatively, this means 0 threads). It is good that you have studied that multithreading does not multiply your network bandwidth. Now, it is time to know that threads do computation. They are not doing the (high-latency) communication. The communication is performed by a network adapter, which is another process, running really in parallel with with CPU. It is stupid to allocate a thread (see which resources allocated are listed by this gentlemen who claims that you need 1 thread) just to sleep until network adapter finishes its job. You need no threads for I/O = you need 0 threads.
It makes sense to allocate the threads for computation to make in parallel with I/O request(s). The amount of threads will depend on the computation-to-communication ratio and limited by the number of cores in your CPU.
Sorry, I had to say that despite you have certainly implied the commitment to blocking I/O, so many people do not understand this basic thing. Take the advise, use asynchronous I/O and you'll see that the issue does not exist.
As mentioned in one of the linked answers you refer to, Brian Goetz has covered this well in his article.
He seems to imply that in your situation you would be advised to gather metrics before committing to a thread count.
Tuning the pool size
Tuning the size of a thread pool is largely a matter of avoiding two mistakes: having too few threads or too many threads. ...
The optimum size of a thread pool depends on the number of processors available and the nature of the tasks on the work queue. ...
For tasks that may wait for I/O to complete -- for example, a task that reads an HTTP request from a socket -- you will want to increase the pool size beyond the number of available processors, because not all threads will be working at all times. Using profiling, you can estimate the ratio of waiting time (WT) to service time (ST) for a typical request. If we call this ratio WT/ST, for an N-processor system, you'll want to have approximately N*(1+WT/ST) threads to keep the processors fully utilized.
My emphasis.
Have you considered using Actors?
Best practises.
Actors should be like nice co-workers: do their job efficiently
without bothering everyone else needlessly and avoid hogging
resources. Translated to programming this means to process events and
generate responses (or more requests) in an event-driven manner.
Actors should not block (i.e. passively wait while occupying a Thread)
on some external entity—which might be a lock, a network socket,
etc.—unless it is unavoidable; in the latter case see below.
Sorry, I can't elaborate, because haven't much used this.
UPDATE
Answer in Good use case for Akka might be helpful.
Scala: Why are Actors lightweight?
Pretty sure in the described circumstances, the optimal number of threads is 1. In fact, that is surprisingly often the answer to any quesion of the form 'how many threads should I use'?
Each additonal thread adds extra overhead in terms of stack (and associated GC roots), context switching and locking. This may or not be measurable: the effor to meaningfully measure it in all target envoronments is non-trivial. In return, there is little scope to provide any benifit, as processing is neither cpu nor io-bound.
So less is always better, if only for reasons of risk reduction. And you cant have less than 1.
I assume the desired optimization is the time to process all requests. You said the number of requests is "thousands". Evidently, the fastest way is to issue all requests at once, but this may overflow the network layer. You should determine how many simultaneous connections can network layer bear, and make this number a parameter for your program.
Then, spending a thread for each request require a lot of memory. You can avoid this using non-blocking sockets. In Java, there are 2 options: NIO1 with selectors, and NIO2 with asynchronous channels. NIO1 is complex, so better find a ready-made library and reuse it. NIO2 is simple but available only since JDK1.7.
Processing the responses should be done on a thread pool. I don't think the number of threads in the thread pool greatly affects the overall performance in your case. Just make tuning for thread pool size from 1 to the number of available processors.
In our high-performance systems, we use the actor model as described by #Andrey Chaschev.
The no. of optimal threads in your actor model differ with your CPU structure and how many processes (JVMs) do you run per box. Our finding is
If you have 1 process only, use total CPU cores - 2.
If you have multiple process, check your CPU structure. We found its good to have no. of threads = no. of cores in a single CPU - e.g. if you have a 4 CPU server each server having 4 cores, then using 4 threads per JVM gives you best performance. After that, always leave at least 1 core to your OS.
An partial answer, but I hope it helps. Yes, memory can be an issue: Java reserves 1 MB of thread stack by default (at least on Linux amd64). So with a few GB of RAM in your box, that limits your thread count to a few thousand.
You can tune this with a flag like -XX:ThreadStackSize=64. That would give you 64 kB, which is plenty in most situations.
You could also move away from threading entirely and use epoll to respond to incoming responses. This is far more scalable but I have no practical experience with doing this in Java.
I am extracting out lines matching a pattern from log files. Hence I allotted each log file to a Runnable object which writes the found pattern lines to a result file. (well synchronised writer methods)
Important snippet under discussion :
ExecutorService executor = Executors.newFixedThreadPool(NUM_THREAD);
for (File eachLogFile : hundredsOfLogFilesArrayObject) {
executor.execute(new RunnableSlavePatternMatcher(eachLogFile));
}
Important Criteria :
The number of log files could be very few like 20 or for some users the number of logs files could cross 1000. I recorded series of tests in an excel sheet and I am really concerned on the RED marked results. 1. I assume that if the number of threads created is equal to the number of files to be processed then the processing time would be less, compared to the case when the number of thread is lesser than the number of files to be processed which didn't happen. (please advice me if my understanding is wrong)
Result :
I would like to identify a value for the NUM_THREAD which is efficient for less number of files as well as 1000's of files
Suggest me answer for Question 1 & 2
Thanks !
Chandru
you just found that your program is not CPU bound but (likely) IO bound
this means that beyond 10 threads the OS can't keep up with the requested reads of all the thread that want their data and more threads are waiting for the next block of data at a time
also because writing the output is synchronized across all threads that may even be the biggest bottle neck in your program, (producer-consumer solution may be the answer here to minimize the time threads are waiting to output)
the optimal number of threads depends on how fast you can read the files (the faster you can read the more threads are useful),
It appears that 2 threads is enough to use all your processing power. Most likely you have two cores and hyper threading.
Mine is a Intel i5 2.4GHz 4CPU 8GB Ram . Is this detail helpful ?
Depending on the model, this has 2 cores and hyper-threading.
I assume that if the number of threads created is equal to the number of files to be processed then the processing time would be less,
This will maximise the overhead, but wont give you more cores than you have already.
When parallelizing, using a lot more threads than you have available cpu cores will usually increase the overall time. You system will spend some overhead time switching from thread to thread on one cpu core instead of having it executing the tasks at once, one after an other.
If you have 8 cpu cores on your computer, you might observe some improvement using 8/9/10 threads instead of using only 1 while using 20+ threads will actually be less efficient.
One problem is that I/O doesn't parallelize well, especially if you have a non-SSD, since sequential reads (what happens when one thread reads a file) are much faster than random reads (when the read head has to jump around between different files read by several threads). I would guess you could speed up the program by reading the files from the thread sending the jobs to the executor:
for (File file : hundredsOfLogFilesArrayObject) {
byte[] fileContents = readContentsOfFile(file);
executor.execute(new RunnableSlavePatternMatcher(fileContents));
}
As for the optimal thread count, that depends.
If your app is I/O bound (which is quite possible if you're not doing extremely heavy processing of the contents), a single worker thread which can process the file contents while the original thread reads the next file will probably suffice.
If you're CPU bound, you probably don't want many more threads than you've got cores:
ExecutorService executor = Executors.newFixedThreadPool(
Runtime.getRuntime().availableProcessors());
Although, if your threads get suspended a lot (waiting for synchronization locks, or something), you may get better result with more threads. Or if you've got other CPU-munching activitities going on, you may want fewer threads.
You can try using cached thread pool.
public static ExecutorService newCachedThreadPool()
Creates a thread pool that creates new threads as needed, but will reuse previously constructed threads when they are available. These pools will typically improve the performance of programs that execute many short-lived asynchronous tasks. Calls to execute will reuse previously constructed threads if available.
You can read more here
Just wondering what is the best way to decide when to stop creating new threads on a single-core machine which is running the same program multiple times as a thread?
The threads are fetching web content and doing a bit of processing, which means the load of each thread is not constant all the way until the thread terminates.
I'm thinking to have a thread which monitors the CPU/RAM load, and stop creating threads if the load reaches a certain treshold, but also stop creating threads if a certain threads count has been reached, to make sure the CPU doesn't get overloaded.
Any feedback on what techniques are out there to achieve this?
Many thanks,
Vladimir
It is going to be difficult to do this by monitoring the CPU used by the current process. Those numbers tend to lag reality and the result is going to be peaks and valleys to a large degree. The problem is that your threads are mostly going to be blocked by IO and there is not any good way to anticipate when bytes will be available to be read in the near future.
That said, you could start out with a ThreadPoolExecutor at a certain max thread number (for a single processor let's say 4) and then check every 10 seconds or so the load average. If the load average is below what you want then you could call setMaximumPoolSize(...) with a larger value to increase it for the next 10 seconds. You may need to poll 30 or more seconds between each calculation to smooth out the performance of your application.
You could use the following code to track your total CPU time for all threads. Not sure if that's the best way to do it
long total = 0;
for (long id : threadMxBean.getAllThreadIds()) {
long cpuTime = threadMxBean.getThreadCpuTime(id);
if (cpuTime > 0) {
total += cpuTime;
}
}
// since is in nano-seconds
long currentCpuMillis = total / 1000000;
Instead of trying to maximize the CPU level for your spider, you might consider trying to maximize throughput. Take the sample of the number of pages spidered per a unit of time and increase or decrease the max number of threads in your ExecutorService until this is maximized.
One thing to consider is to use NIO and selectors so your threads are always busy as opposed to always waiting for IO. Here's a good example tutorial about NIO/Selectors. You might also consider using Pyronet which seems to provide some good features around NIO.
If async I/O is not a good fit, I would consider using thread pools, e.g. ThreadPoolExecutor, so you don't have the overhead of creating, destroying and recreating threads.
Then I would do performance testing to tweak the max number of threads offers the best performance.
You could start with 10 threads, then rerun your performance test with 20 threads until you hone in on an optimal value. At the same time I would use system tools (depending on your OS) to monitor the thread run queue, JVM, etc.
For the performance test you would have to ensure that your test is repeatable (i.e. using the same inputs) and representative of the actual input that your program would be using.
What is the rough "cost" of using threads in java? Are the any rule of thumbs/empirical values, how much memory the creation of one thread costs? Is there a rough estimate how many CPU cycles it costs to create a thread?
Context: In a servlet of a webapplication I want to parallelize the content creation as parts of the content are file based, database based as well as webservices based. But this would mean that for every "http-request-thread" (of my serlvet container) I will have two-to-four additional threads. Note that I will be using the ExecutorService in Java 6.
What should I expect when I use hundreds to thousands of Java threads on a web-server?
Each thread has its own stack, and consequently there's an immediate memory impact. The default thread stack size is ,IIRC, for Java 6, 512k (different JVMs/version will possibly have different defaults). This figure is adjustable using the -Xss option. Consequently using hundreds of threads will have an impact on the memory the VM consumes (quite possibly before any CPU impact unless those threads are running).
I've seen clients run into problems related to threads/memory, since it's not an obvious link. It's trivial to create 100,000 threads (using executors/pools etc.) and memory problems don't appear to be immediately attributable to this.
If you're servicing many clients, you may want to take a look at the Java NIO API and in particular multiplexing, which allows asynchronous network programming. That will permit you to handle many clients with only one thread, and consequently reduce your requirement for a huge number of threads.
That depends: It depends on the OS, the Java version, and the CPU. The only way to figure this out is to try it and measure the results.
Since you'll be using the ExecutorService, it will be simple to control the number of threads. Don't use too few or requests will stack up. If you use too many, you'll run into performance problems with your file system and the DB long before Java runs out of threads.
During preparation of a magazine article about Fibers (aka project loom) I run some simple tests (Windows 10, JDK-Loom 15.b3):
AtomicInteger counter = new AtomicInteger(T);
AtomicBoolean go = new AtomicBoolean(false);
for (int i = 0; i < 10000; i++) {
Thread.newThread(Thread.VIRTUAL, () -> { // <-- remove Thread.VIRTUAL for plain Threads
while (!go.get()) Thread.sleep(1);
counter.decrementAndGet();
}).start();
}
My Windows desktop (i7-8700K) needs about 400000 ms to create all the 10000 threads and additional 200 ms to run the counter down.
Surprisingly I could not confirm the memory consumption of 512k per thread (1Mb according to some other sources). Windows memory monitor shows additional memory consumption of only about 500Mb for all the 10000 threads (50k per thread)
Project Loom Fibers manage to run the test in 30 respectively 50 ms and show no measurable memory consumption.