I'm trying to debug a file descriptor leak in a Java webapp running in Jetty 7.0.1 on Linux.
The app had been happily running for a month or so when requests started to fail due to too many open files, and Jetty had to be restarted.
java.io.IOException: Cannot run program [external program]: java.io.IOException: error=24, Too many open files
at java.lang.ProcessBuilder.start(ProcessBuilder.java:459)
at java.lang.Runtime.exec(Runtime.java:593)
at org.apache.commons.exec.launcher.Java13CommandLauncher.exec(Java13CommandLauncher.java:58)
at org.apache.commons.exec.DefaultExecutor.launch(DefaultExecutor.java:246)
At first I thought the issue was with the code that launches the external program, but it's using commons-exec and I don't see anything wrong with it:
CommandLine command = new CommandLine("/path/to/command")
.addArgument("...");
ByteArrayOutputStream errorBuffer = new ByteArrayOutputStream();
Executor executor = new DefaultExecutor();
executor.setWatchdog(new ExecuteWatchdog(PROCESS_TIMEOUT));
executor.setStreamHandler(new PumpStreamHandler(null, errorBuffer));
try {
executor.execute(command);
} catch (ExecuteException executeException) {
if (executeException.getExitValue() == EXIT_CODE_TIMEOUT) {
throw new MyCommandException("timeout");
} else {
throw new MyCommandException(errorBuffer.toString("UTF-8"));
}
}
Listing open files on the server I can see a high number of FIFOs:
# lsof -u jetty
...
java 524 jetty 218w FIFO 0,6 0t0 19404236 pipe
java 524 jetty 219r FIFO 0,6 0t0 19404008 pipe
java 524 jetty 220r FIFO 0,6 0t0 19404237 pipe
java 524 jetty 222r FIFO 0,6 0t0 19404238 pipe
when Jetty starts there are just 10 FIFOs, after a few days there are hundreds of them.
I know it's a bit vague at this stage, but do you have any suggestions on where to look next, or how to get more detailed info about those file descriptors?
The problem comes from your Java application (or a library you are using).
First, you should read the entire outputs (Google for StreamGobbler), and pronto!
Javadoc says:
The parent process uses these streams
to feed input to and get output from
the subprocess. Because some native
platforms only provide limited buffer
size for standard input and output
streams, failure to promptly write the
input stream or read the output stream
of the subprocess may cause the
subprocess to block, and even
deadlock.
Secondly, waitFor() your process to terminate.
You then should close the input, output and error streams.
Finally destroy() your Process.
My sources:
http://stuffthathappens.com/blog/2007/11/28/crash-boom-too-many-open-files/
http://www.javaworld.com/javaworld/jw-12-2000/jw-1229-traps.html?page=4
http://kylecartmell.com/?p=9
As you are running on Linux I suspect you are running out of file descriptors. Check out ulimit. Here is an article that describes the problem: http://www.cyberciti.biz/faq/linux-increase-the-maximum-number-of-open-files/
Aside from looking into root cause issues like file leaks, etc. in order to do a legitimate increase the "open files" limit and have that persist across reboots, consider editing
/etc/security/limits.conf
by adding something like this
jetty soft nofile 2048
jetty hard nofile 4096
where "jetty" is the username in this case. For more details on limits.conf, see http://linux.die.net/man/5/limits.conf
log off and then log in again and run
ulimit -n
to verify that the change has taken place. New processes by this user should now comply with this change. This link seems to describe how to apply the limit on already running processes but I have not tried it.
The default limit 1024 can be too low for large Java applications.
Don't know the nature of your app, but I have seen this error manifested multiple times because of a connection pool leak, so that would be worth checking out. On Linux, socket connections consume file descriptors as well as file system files. Just a thought.
You can handle the fds yourself. The exec in java returns a Process object. Intermittently check if the process is still running. Once it has completed close the processes STDERR, STDIN, and STDOUT streams (e.g. proc.getErrorStream.close()). That will mitigate the leaks.
This problem comes when you are writing data in many files simultaneously and your Operating System has a fixed limit of Open files. In Linux, you can increase the limit of open files.
https://www.tecmint.com/increase-set-open-file-limits-in-linux/
How do I change the number of open files limit in Linux?
Related
I have a Unix system mounting an NFS "share" from a Windows server. On the Windows server I have a PowerShell script that will check every 10 s if there is a new file coming in on the NFS share and Move-Item it somewhere else and then it gets processed further.
What we are seeing is that files are corrupted in this process. My hunch is that the NFS writing takes a little longer, the script picks up an incomplete file and Move-Item it to the other folder. There is also a theory a colleague has that the further processing picks up the file before Move-Item has completed. I do not believe in that theory, because Move-Item on the same file system should be an atomic metadata only operation. (Don't be confused by the NFS reference, the Windows server has these files locally, the NFS share is mounted by the Unix system, so Move-Item does not involve NFS, and in my case, doesn't cross file system boundaries either.)
Either way, I want to know why it would be that the writing of the file to NFS which is by a Java process on Unix, is not locking the file on the Windows host file system? Would I have to explicitly on Java cause an NFS lock to be set somehow? Is there even support for fcntl lock feature from Java?
Also, if I used power-shell Copy command rather than Move-Item, there would be a certain moment of file incomplete copied. Isn't the Copy command automatically setting a lock on the destination file until it is finished?
EDIT: This is actually getting more and more puzzling. First I tried locking the file explicitly while writing to the NFS. This is Java and it creates a huge problem with NFS, I couldn't set up the nlockmgr service to actually work, there is a firewall involved between the two, I made all the right passages, and get no response to the lock requests from the Windows NFS server. This causes the Java side to completely hang, so bad you can't even kill -KILL the JVM. The only way to end this nightmare is to reboot the Unix system, crazy! There also isn't a timeout on the lock request, big problem in Java, other places like read from socket I have seen such problems too, you can't kill a thread that hangs reading from a socket. Whatever, there is no way to cancel a lock request. So I gave up on that.
Then I added a filter in the PowerShell script to only move files that have a last written to time less than 10 seconds before the current time. That should leave more than enough time for the writer to finish. But apparently it doesn't help either.
UPDATE: but yes, I now watched it, that copy process on Unix from S3 to NFS to Windows NTFS takes a long time, and it is all running on AWS so even S3 should be considered fast. Yet, it crawls between 0 kB ... 64 kB ... 90 kB with 10 seconds not enough to wait between each new chunk written. I updated this wait time to 30 seconds and that seems to work, but it is not guaranteed.
The locking would be the right solution, but I have 2 major obstacles:
can't get the Windows NFS "share" to work with mounted on Unix and nlockmgr service playing
Java JVM will totally stall unkillable if the nlockmgs has a problem.
I need to run a program after a Linux EC2 machine is provisioned on AWS. The following code will get "Too many open file" error. my_program will open a lot of files, maybe around 5000.
string cmd = "my_program";
Process process = new ProcessBuilder()
.inheritIO()
.command(cmd)
.start();
However, running my_program in the console can finish without any error. What's the ulimit when running the program using ProcessBuilder()...start()?
ulimit -n output 65535 in bash terminal.
First find out the limits your app has when running:
ps -ef | grep <<YOUR-APP-NAME>>
then:
cat /proc/<<PID-of-your-APP>>/limits
Here the problem is that you app. starts under X or Y user and these users have a different ulimit setup.
Check:
cat /etc/security/limits
... I think and increase those values.
Just my 2 cents...
You need to ensure that you close() the files after use. They will be closed by the garbage collector (I'm not completely sure about this, as it can differ on each implementation) but if you process many files without closing them, you can run out of file descriptors right before the garbage collector has any chance to execute.
Another thing you can do is to use a resource based try statement, that will ensure that every resource you declare in the parenthesis group is a Closeable resource that will be forcibly close()d on exit from the try.
Anyway, if you want to rise the value of the maximum number of open files per process, just look at your shell's man page (bash(1) most probably) and search in it about the ulimit command (there's no ulimit manual page as it is an internal command to the shell, the ulimit values are per process, so you cannot start a process to make your process change it's maximum limits)
Beware that linux distributions (well, the most common linux distributions) don't have normally a way to configure an enforced value for this in a per user way (there's a full set of options to do things like this in BSD systems, but linux's login(8) program has not implemented the /etc/login.conf feature to do this) and rising arbitrarily this value can be a security problem in your system if it runs as a multiuser system.
Usually, I use jstack to check if the java process is working normally. While i found, when the /tmp/java_pid<num> (the num is pid of java process) socket file has been deleted, jstack will not work. like this:
[xxx]$ jstack -l 5509
5509: Unable to open socket file: target process not responding or HotSpot VM not loaded
The -F option can be used when the target process is not responding
(PS. I didn't want to use the "-F", there may be other problems)
Is there any way to change the socket file location(not /tmp)? or to generate the socket file again when found not existed? Now what i did is to restart the java process again, a very bad solution.
Thanks!
/tmp/.java_pid socket is used by HotSpot Dynamic Attach mechanism. It is the way how jstack and other utilities communicate with JVM.
You cannot change the path - it is hardcoded in JVM source code. Neither you can force JVM to regenerate it, because the Attach Listener is initialized only once in HotSpot lifetime.
jstack -F works in a quite different way.
In order to check whether Java process works fine, I suggest using JMX remote.
I am currently facing an issue with my java webapp running on Jetty 7.4.5.v20110725 on Linux. My Webapp serving static content is running out of file descriptors after a few days from its start timestamp. I am starting the jetty server with useFileMappedBuffer = true( in webDefaults.xml ) . I am using jdk1.6.0_30 . Please let me know if you'll have any suggestions on how to fix this issue.
Please note that this issue does not occur when useFileMappedBuffer = false (in webDefaults.xml).
If your application has run out of timestamps next time, please try to find out which files are open and if open connections are causing a problem.
Try to list open files by calling (I think it's lsof -p, but try or look at the lsof manpage as I'm writing this out of my mind):
lsof -p <jettypid>
This will show you what files are opened by the jetty process. Look for files which probably should have been closed already, etc.
Then do a:
netstat -an
This will show you established network connections. Look if there's lots of connections in CLOSE_WAIT state or similar indicating that connections are not properly closed.
Also have a look at your system wide OS limits with:
ulimit -a
It'll show you how many file descriptors can be opened by a single process. If you've a site with high traffic and the pretty common default value of 1024 max fd, you might need to raise that. If you think that traffic is the problem have a look at this guide: http://wiki.eclipse.org/Jetty/Howto/High_Load
However you've stated that the problem does occur only after a couple of days. This usually indicates not properly closed connections, file resources, etc.
If unsure what to do with the output of the commands above, feel free to paste them.
Independent of the problem I'd recommend you to upgrade to the latest jetty 7.x.
I am starting too many processes in Java using:
Runtime.getRuntime().exec("java -ja myJar.jar")
When it reaches around 350 processes, I get an IO exception:
Cannot run program "java": java.io.IOException: error=24, Too many open files
Exception in creating process
at java.lang.ProcessBuilder.start(ProcessBuilder.java:475)
at java.lang.Runtime.exec(Runtime.java:610)
at java.lang.Runtime.exec(Runtime.java:448)
at java.lang.Runtime.exec(Runtime.java:345)
In each process I am using one database connection.
I am running ubuntu 32 bit OS. But when I run:
ulimit -u
I can see that process limit is unlimited. What could be the problem?
All systems have their limits - sounds like you've hit your system's limit.
In linux, creating new processes consumes lots of inodes (like windows handles), which is a lot like a file handle. The only way around it is to allocate more via kernel settings (I don't know how offhand).
Have you considered starting lots of java Threads instead? They would consume a lot less system resources.
The problem is that you have too many files open, not too many processes in operation. To check the file limit do:
ulimit -n
It will commonly be 1024.
Check http://www.puschitz.com/TuningLinuxForOracle.shtml and search for ulimit for good instructions on changing this limit.
Running 350 JVM instances is not the normal way. Can you redesign and run 350 "main threads" within the same JVM. That is how servlet container works. All web applications are running in the same JVM.
PD. ulimit man pages says to see max. open files is ulimit -n.