Slowing process creation under Java?

Slowing process creation under Java? - java

I have a single, large heap (up to 240GB, though in the 20-40GB range for most of this phase of execution) JVM [1] running under Linux [2] on a server with 24 cores. We have tens of thousands of objects that have to be processed by an external executable & then load the data created by those executables back into the JVM. Each executable produces about half a megabyte of data (on disk) that when read right in, after the process finishes, is, of course, larger.
Our first implementation was to have each executable handle only a single object. This involved the spawning of twice as many executables as we had objects (since we called a shell script that called the executable). Our CPU utilization would start off high, but not necessarily 100%, and slowly worsen. As we began measuring to see what was happening we noticed that the process creation time [3] continually slows. While starting at sub-second times it would eventually grow to take a minute or more. The actual processing done by the executable usually takes less than 10 seconds.
Next we changed the executable to take a list of objects to process in an attempt to reduce the number of processes created. With batch sizes of a few hundred (~1% of our current sample size), the process creation times start out around 2 seconds & grow to around 5-6 seconds.
Basically, why is it taking so long to create these processes as execution continues?
[1] Oracle JDK 1.6.0_22
[2] Red Hat Enterprise Linux Advanced Platform 5.3, Linux kernel 2.6.18-194.26.1.el5 #1 SMP
[3] Creation of the ProcessBuilder object, redirecting the error stream, and starting it.

My guess is that you MIGHT be running into problems with fork/exec, if Java is using the fork/exec system calls to spawn subprocesses.
Normally fork/exec is fairly efficient, because fork() does very little - all pages are copy-on-write. This stops being so true with very large processes (i.e. those with gigabytes of pages mapped) because the page tables themselves take a relatively long time to create - and of course, destroy, as you immediately call exec.
As you're using a huge amount of heap, this might be affecting you. The more pages you have mapped in, the worse it may become, which could be what's causing the progressive slowdown.
Consider either:
Using posix_spawn, if that is NOT implemented by fork/exec in libc
Using a single subprocess which is responsible for creating / reaping others; spawn this once and use some IPC (pipes etc) to tell it what to do.
NB: This is all speculation; you should probably do some experiments to see whether this is the case.

Most likely you are running out of a resource. Are your disks getting busier as you create these processes. Do you ensure you have less processes than you have cores? (To minimise context switches) Is your load average below 24?
If your CPU consumption is dropping you are likely to be hitting IO (disk/network) contention i.e. the processes cannot get/write data fast enough to keep them busy. If you have 24 cores, how many disks do you have?
I would suggest you have one process per CPU (in your case I imagine 4) Give each JVM six tasks to run concurrently to use all its cores without overloading the system.

You would be much better off using a set of long lived processes pulling your data off of queues and sending them back that constantly forking new processes for each event, especially from the host JVM with that enormous heap.
Forking a 240GB image is not free, it consumes a large amount of virtual resources, even if only for a second. The OS doesn't know how long the new process will be aware so it must prepare itself as if the entire process will be long lived, thus it sets up the virtual clone of all 240GB before obliterating it with the exec call.
If instead you had a long lived process that you could end objects to via some queue mechanism (and there are many for both Java and C, etc.), that would relieve you of some of the pressure of the forking process.
I don't know how you are transferring the data form the JVM to the external program. But if your external program can work with stdin/stdout, then (assuming you're using unix), you could leverage inetd. Here you make a simple entry in the inetd configuration file for your process, and assign it a port. Then you open up a socket, pour the data down in to it, then read back from the socket. Inetd handles the networking details for you and your program works as simply with stdin and stdout. Mind you'll have an open socket on the network, which may or may not be secure in your deployment. But it's pretty trivial to set up even legacy code to run via a network service.
You could use a simple wrapper like this:
#!/bin/sh
infile=/tmp/$$.in
outfile=/tmp/$$.out
cat > $infile
/usr/local/bin/process -input $infile -output $outfile
cat $outfile
rm $infile $outfile
It's not the highest performing server on the planet designed to zillions of transactions, but it's sure a lot faster than forking 240GB over and over and over.

I most agree with Peter. Your are most probably suffering from IO bottlenecks. Once you have may process the OS has to work harder too for trivial tasks hence having exponential performance penalty.
So the 'solution' could be to create 'consumer' processes, only initialise certain few; as Peter suggested one per CPU or more. Then use some form of IPC to 'transfer' these objects to the consumer processes.
Your 'consumer' processes should manage sub-process creation; the processing executable which I presume you don't have any access to, and this way you don't clutter the OS with too many executables and the 'job' will be "eventually" complete.

Related

Does downloading with multiple threads actually speed things up?

So, I was starting up minecraft a few days ago and opened up it's developer console to see what it was doing while it was updating itself. I noticed one of the lines said the following:
Downloading 32 files. (16 threads)
Now, the first thing that came to mind was: the processor can still only do one thing at a time, all threads do is split each of their tasks up and distribute the CPU power between them, so what would the purpose be of downloading multiple files on multiple threads if each thread is still only being run on a single processor?
Then, in the process of deciding whether or not I should ask this question on SO, I remembered that multiple cores can reside on one processor. For example, my processor is quad-core. So, you can actually accomplish 4 downloads truly simultaneously. Now that sounds like it makes sense. Except for the fact that there are 16 threads being use for minecraft's download. So, basically my question is:
Does increasing the number of threads during a download help the speed at all? (Assuming a multi-core processor, and the thread count is less than the core count.)
And
If you increase the number of threads to past the number of cores, does speed still increase? (It sounds to me like the downloads would be max-speed after 4 threads, on a quad-core processor.)

Downloads are network-bound, not CPU-bound. So theoretically, using multiple threads will not make it faster.
On the one hand, if your program downloads using synchronous (blocking) I/O, then multiple threads simply enables less blocking to occur. In general, on the other hand, it is more sensible to just use a single thread with asynchronous I/O.
On the gripping hand, asynchronous I/O is trickier to code correctly than synchronous I/O (which is straightforward). So the developers may have just decided to favour ease of programming over pure performance. (Or they may favour compatibility with older Java platforms: real async I/O is only available with NIO2 (which came with Java 7).)

When one thread downloads one file, it will spend some time waiting. When one thread downloads N files, one after another, it will spend, on average, N times as much total wait time.
When N threads each download one file, each of those threads will spend some time waiting, but some of those waits will be overlapped (e.g., thread A and thread B are both waiting at the same time.) The end result is that it may take less wall-clock time to get all N of the files.
On the other hand, if the threads are waiting for files from the same server, each thread's individual wait time may be longer.
The question of whether or not there is an over-all performance benefit depends on the client, on the server, and on the available network bandwidth. If the network can't carry bytes as fast as the server can pump them out, then multi-threading the client probably won't save any time, if the server is single-threaded, then multi-threading the client definitely won't help, but if the conditions are right (e.g., if you have a fast internet connection and especially if the files are coming from a server farm instead of a single machine), then multi-threading potentially can speed things up.

Normally it will not be faster, but there are always exceptions.
Assuming for each download thread, you are opening a new connection, then if
The network (either your own network, or target system) is limiting the download speed for each connection, or
You are downloading from multiple servers, and etc
Or, if the "download" is not a plain download, but downloading something and do some CPU intensive processing on that.
In such cases you may see download speed become faster when having multiple thread.

How to handle thousands of threads in Java without using the new java.util.concurrent package

I have a situation in which I need to create thousands of instances of a class from third party API. Each new instance creates a new thread. I start getting OutOfMemoryError once threads are more than 1000. But my application requires creating 30,000 instances. Each instance is active all the time. The application is deployed on a 64 bit linux box with 8gb RAM and only 2 gb available to my application.
The way the third party library works, I cannot use the new Executor framework or thread pooling.
So how can I solve this problem?
Note that using thread pool is not an option. All threads are running all the time to capture events.
Sine memory size on the linux box is not in my control but if I had the choice to have 25GB available to my application in a 32GB system, would that solve my problem or JVM would still choke up?
Are there some optimal Java settings for the above scenario ?
The system uses Oracle Java 1.6 64 bit.

I concur with Ryan's Answer. But the problem is worse than his analysis suggests.
Hotspot JVMs have a hard-wired minimum stack size - 128k for Java 6 and 160k for Java 7.
That means that even if you set the stack size to the smallest possible value, you'd need to use roughly twice your allocated space ... just for thread stacks.
In addition, having 30k native threads is liable to cause problems on some operating systems.
I put it to you that your task is impossible. You need to find an alternative design that does not require you to have 30k threads simultaneously. Alternatively, you need a much larger machine to run the application.
Reference: http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2012-June/003867.html

I'd say give up now and figure another way to do it. Default stack size is 512K. At 30k threads, that's 15G in stack space alone. To fit into 2G, you'll need to cut it down to less than 64K stacks, and that leaves you with zero memory for the heap, including all the Thread objects, or the JVM itself.
And that's just the most obvious problem you're likely to run into when running that many simultaneous threads in one JVM.

I think we are missing lots of details, but would a distributed plateform would work? Each of individual instances would manage a range of your classes instances. Those plateform could be running on different pcs or virtual machines and communicate with each other?

I had the same problem with an SNMP provider that required a thread for each outstanding get (I wanted to have tens of thousands of outstanding gets going on at once). Now that NIO exists I'd just rewrite the library myself if I had to do this again.
You cannot solve it in "Java Code" or configuration. Windows chokes at around 2-3000 threads in my experience (this may have changed in later versions). When I was doing this I surprisingly found that Linux supported even less threads (around 1000).
When the system stops supplying threads, "Out of Memory" is the exception you should expect to see--so I'm sure that's it--I started getting this exception long before I ran out of memory. Perhaps you could hack linux somehow to support more, but I have no idea how.
Using the concurrent package will not help here. If you could switch over to "Green" threads it might, but that might take recompiling the JVM (it would be nice if it was available as a command line switch, but I really don't think it is).

Throttling CPU from within Java

I have seen many questions in this (and others) forum with the same title, but none of them seemed to address exactly my problem. This is it:
I have got a JVM that eats all the CPU on the machine that hosts it. I would like to throttle it, however I cannot rely on any throttling tool/technique external to Java as I cannot make assumptions as to where this Vm will be run. Thus, for instance, I cannot use processor affinity because if the VM runs on a Mac the OS won't make process affinity available.
What I would need is an indication as to whether means exist within Java to ensure the thread does not take the full CPU.
I would like to point straightaway that I cannot use techniques based on alternating process executions and pauses, as suggested in some forums, because the thread needs to generate values continuously.
Ideally I'd like some mean for, for instance, setting some VM or thread priority, or cap in some way the percentage of CPU consumed.
Any help would be much appreciated.

What I would need is an indication as to whether means exist within Java to ensure the thread does not take the full CPU.
There is no way that I know of to do this within Java except for tuning your application to use less CPU.
You could put some Thread.sleep(...); calls in your calculation methods. A profiler would help with showing you the hot loops/methods/etc..
Forking fewer threads would also affect the CPU used. Moving to fixed sized thread-pools or lowering the number of threads in your pools.
It may not be CPU that is the problem but other resources. Watch your IO bandwidth for example. Slowing down your network or disk reads/writes might restore your server to proper operation.
From outside of the JVM you could use the ~unix nice command to affect the priority of the running JVM to not dominate the system. This will give it CPU if available but will let other applications get more of the CPU.

I take it you want something more reliable than setting the threads' priorities?
If you want throttled execution of some code that is constantly generating values, you need to look into chunking up the work the thread(s) do, and coding in your own timer. For example, the java.util.Timer allows for scheduling execution at a fixed rate.
Any other technique will still consume as much CPU as is available (1 core per thread, assuming no locks preventing concurrent execution) when the scheduler doesn't have other tasks to prioritize ahead of yours.

The detail is simply that you said "must generate values continuously", and if that, to the extreme, is true, then CPU saturation is actually the goal.
But, if you define "continuously" as X values per second, then there is room to work.
Because then you can run your process at 100% CPU, measure the number of values over time, and if you find that it's generates more values than necessary (more than X/sec), then you can now insert pauses in to the process as appropriate until the value rate reaches your desired goal.
The plan being to continually monitor and adjust the pauses to maintain your value rate over time. Then your process will take as much CPU as necessary to meet your values/sec goal.
Addenda:
If you have a benchmark of values/sec that you are happy with, then interjecting the sleeps will give "all the priority necessary" to the other applications, but still maintain your throughput. If, on the other hand, you don't have any solid requirement, that is the requirement is "run as fast as possible when nothing else is running, with no actual requirement for ANY results if some other process dominates the CPU", then that's truly a kernel issue of the host OS, and not something the JVM has any direct, portable mechanism to address.
On Unix systems, you have the nice(1) command to adjust process (not thread) priority, and Windows has their own mechanism. With these commands, you can knock the priority of your Java process to just above "idle" (the default "process" that always runs when nothing else is running). But it's platform specific, as this is an inherently platform specific problem. This may well be managed through platform specific startup scripts that launch your Java program (or even a Java launcher that detects the platform and "does the right thing" before executing your actual code).
Most systems will allow you to lower your own process priorities, but few will let you raise unless you're an admin/superuser or have whatever the appropriate role is for your host OS.

Check to see if you have any "tight loops" in your code.
while (true) {
if (object.checkSomething()) {
...
}
}
If you do, then you are burning the CPU cycles on millions of checks that are probably not that time critical. The JVM will oblige (because it doesn't know if the check is "important" or not) and you'll get 100% CPU.
If you find such loops, rewrite them like so
while (true) {
if (object.checkSomething()) {
...
}
try {
Thread.sleep(100);
} catch (InterruptedException e) {
// purposefully do nothing
}
}
and the sleeping will voluntarily release the CPU within the loop, preventing it from running too quickly (and checking the condition too many times).

Really interesting thread. I found out Java does not provide means for doing what I want to do, and the only way to do this is from outside the JVM.
I ended up using nice to alter the scheduling priority in my test (Linux) environment, will still need to find something similar for WIn-based OSs.
Everyone's intervention has been much appreciated.

How can I get the cpu usage a jvm process consumes for each cpu core using java?

Actually I'm using java to monitor the cpu usage for a certain java process.Here are my questions:
First,is there a limit that a single process can only consume cpu processing time on 1 or limited cpu cores?Or it can use cpu time on each of the cpu core?
Second,if I want to monitor the cpu usage of a certain java process for each cpu core,how can I do that?
And I prefer to handle it using pure java,not native method.

To the operating system, a single thread (which I assume is what you mean by "Java process") essentially cannot use CPU on more than one "processor" (which may or may not mean a physical core-- see below) simultaneously.
Generally, whenever a given thread gets a "turn at running", Windows (and I assume other operating systems) will attempt to schedule a given thread on to the same "processor" that it last ran on.
However, the situation is complicated by hyperthreading CPUs which actually present to the operating system several "processors" for what is actually a single core physical core. In this case, it is actually the CPU itself that switches between what instruction of what thread is running on which component of the given core at any one time. (Because, e.g. the core's arithmetic unit could be performing an arithmetic instruction for Thread 1 while the load/store unit is fetching data from memory for an instruction for Thread 2, etc.)
So given the complexity of this situation, even if you can get per-core measurements, I'm not entirely sure quite what useful meaning you would attach to them
P.S. If you'll permit the plug, I don't know if this Java-focussed article on thread scheduling that I wrote a couple of years ago might be useful. I should say I wrote it before either Windows 7 or the latest Intel Core CPUs were released, and there may be one or two updates to the information that would be pertinent (in particular, I don't address the issue of variable core speeds and how that could affect scheduling).

Java Performance Processes vs Threads

I am implementing a worker pool in Java.
This is essentially a whole load of objects which will pick up chunks of data, process the data and then store the result. Because of IO latency there will be significantly more workers than processor cores.
The server is dedicated to this task and I want to wring the maximum performance out of the hardware (but no I don't want to implement it in C++).
The simplest implementation would be to have a single Java process which creates and monitors a number of worker threads. An alternative would be to run a Java process for each worker.
Assuming for arguments sake a quadcore Linux server which of these solutions would you anticipate being more performant and why?
You can assume the workers never need to communicate with one another.

One process, multiple threads - for a few reasons.
When context-switching between jobs, it's cheaper on some processors to switch between threads than between processes. This is especially important in this kind of I/O-bound case with more workers than cores. The more work you do between getting I/O blocked, the less important this is. Good buffering will pay for threads or processes, though.
When switching between threads in the same JVM, at least some Linux implementations (x86, in particular) don't need to flush cache. See Tsuna's blog. Cache pollution between threads will be minimized, since they can share the program cache, are performing the same task, and are sharing the same copy of the code. We're talking savings on the order of 100's of nanoseconds to several microseconds per switch. If that's small potatoes for you, then read on...
Depending on the design, the I/O data path may be shorter for one process.
The startup and warmup time for a thread is generally much shorter. The OS doesn't have to start a process, Java doesn't have to start another JVM, classloading is only done once, JIT-compilation is only done once, and HotSpot optimizations are done once, and sooner.

Well usually, when discussing multi processing (/w one thread per process) versus multi threading in the same process, while the theoretical overhead is bigger in the first case than in the latter (and thus multi processing is theoretically slower than multi threading), in reality on most modern OSs this is not such a big issue. However when discussing it in the Java context, starting a new process is a lot more costly then starting a new thread. Starting a new process means starting up a new instance of the JVM which is very costly especially in terms of memory. I recommend that you start multiple threads in the same JVM.
Moreover, if you say inter-thread communication is not an issue, you can use Java's Executor Service to get a fixed thread pool of size 2x(number of available CPUs). The number of available CPU's can be autodetected at runtime via Java's Runtime class. This way you get a quick simple multithreading going without any boiler plate code.

Actually, if you do this with large scale taks using multiple jvm process is way faster than one jvm with multple threads. At least we never got one jvm runnning as fast as multple jvms.
We do some calculations where each task uses around 2-3GB ram and does some heavy number crunching. If we spawn 30 jvm's and run 30 task they perform around 15-20% better than spawning 30 threads in one jvm. We tried tuning the gc and the various memory sections and never catched up to the first variant.
We did this on various machines 14 tasks on a 16 core server, 34 tasks on a 36 core server etc. Multithreading in java always performed worde than multiple jvm processes.
It may not make any difference on simple tasks but on heavy calculations it seems jvm performce bad on threads.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.