Last summer, I made a Java application that would parse some PDF files and get the information they contain to store them in a SQLite database.
Everything was fine and I kept adding new files to the database every week or so without any problems.
Now, I'm trying to improve my application's speed and I wanted to see how it would fare if I parsed all the files I have from the last two years in a new database. That's when I started getting this error: OutOfMemoryError: Java Heap Space. I didn't get it before because I was only parsing about 25 new files per week, but it seems like parsing 1000+ files one after the other is a lot more demanding.
I partially solved the problem: I made sure to close my connection after every call to the database and the error went away, but at a huge cost. Parsing the files is now unbearably slow. As for my ResultSets and Statements / PreparedStatements, I'm already closing them after every call.
I guess there's something I don't understand about when I should close my connection and when I should keep re-using the same one. I thought that since auto-commit is on, it commits after every transaction (select, update, insert, etc.) and the connection releases the extra memory it was using. I'm probably wrong since when I parse too many files, I end up getting the error I'm mentioning.
An easy solution would be to close it after every x calls, but then again I won't understand why and I'm probably going to get the same error later on. Can anyone explain when I should be closing my connections (if at all except when I'm done)? If I'm only supposed to do it when I'm done, then can someone explain how I'm supposed to avoid this error?
By the way, I didn't tag this as SQLite because I got the same error when I tried running my program on my online MySQL database.
Edit
As it has been pointed out by Deco and Mavrav, maybe the problem isn't my Connection. Maybe it's the files, so I'm going to post the code I use to call the function to parse the files one by one:
public static void visitAllDirsAndFiles(File dir){
if (dir.isDirectory()){
String[] children = dir.list();
for (int i = 0; i < children.length; i++){
visitAllDirsAndFiles(new File(dir, children[i]));
}
}
else{
try{
// System.out.println("File: " + dir);
BowlingFilesReader.readFile(dir, playersDatabase);
}
catch (Exception exc){
System.out.println("Other exception in file: " + dir);
}
}
}
So if I call the method using a directory, it recursively calls the function again using the File object I just created. My method then detects that it's a file and calls BowlingFilesReader.readFile(dir, playersDatabase);
The memory should be released when the method is done I think?
Your first instinct on open resultsets and connections was good, though maybe not entirely the cause. Let's start with your database connection first.
Database
Try using a database connection pooling library, such as the Apache Commons DBCP (BasicDataSource is a good place to start): http://commons.apache.org/dbcp/
You will still need to close your database objects, but this will keep things running smoothly on the database front.
JVM Memory
Increase the size of the memory you give to the JVM. You may do so by adding -Xmx and a memory amount after, such as:
-Xmx64m <- this would give the JVM 64 megs of memory to play with
-Xmx512m <- 512 megs
Be careful with your numbers, though, throwing more memory at the JVM will not fix memory leaks. You may use something like JConsole or JVisualVM (included in your JDK's bin/ folder) to observe how much memory you are using.
Threading
You may increase the speed of your operations by threading them out, assuming the operation you are performing to parse these records is threadable. But more information might be necessary to answer that question.
Hope this helps.
As it happens with Garbage colleciton I dont think the memory would be immediately recollected for the subsequent processes and threads.So we cant entirely put our eggs in that basket.To begin with put all the files in a directory and not in child directories of the parent. Then load the file one by one by iterating like this
File f = null;
for (int i = 0; i < children.length; i++){
f = new File(dir, children[i]);
BowlingFilesReader.readFile(f, playersDatabase);
f = null;
}
So we are invalidating the reference so that the file object is released and will be picked up in the subsequent GC. And to check the limits test it by increasing the no. of files start with 100, 200 ..... and then we will know at what point OME is getting thrown.
Hope this helps.
Related
I'm doing some file I/O with multiple files (writing to 19 files, it so happens). After writing to them a few hundred times I get the Java IOException: Too many open files. But I actually have only a few files opened at once. What is the problem here? I can verify that the writes were successful.
On Linux and other UNIX / UNIX-like platforms, the OS places a limit on the number of open file descriptors that a process may have at any given time. In the old days, this limit used to be hardwired1, and relatively small. These days it is much larger (hundreds / thousands), and subject to a "soft" per-process configurable resource limit. (Look up the ulimit shell builtin ...)
Your Java application must be exceeding the per-process file descriptor limit.
You say that you have 19 files open, and that after a few hundred times you get an IOException saying "too many files open". Now this particular exception can ONLY happen when a new file descriptor is requested; i.e. when you are opening a file (or a pipe or a socket). You can verify this by printing the stacktrace for the IOException.
Unless your application is being run with a small resource limit (which seems unlikely), it follows that it must be repeatedly opening files / sockets / pipes, and failing to close them. Find out why that is happening and you should be able to figure out what to do about it.
FYI, the following pattern is a safe way to write to files that is guaranteed not to leak file descriptors.
Writer w = new FileWriter(...);
try {
// write stuff to the file
} finally {
try {
w.close();
} catch (IOException ex) {
// Log error writing file and bail out.
}
}
1 - Hardwired, as in compiled into the kernel. Changing the number of available fd slots required a recompilation ... and could result in less memory being available for other things. In the days when Unix commonly ran on 16-bit machines, these things really mattered.
UPDATE
The Java 7 way is more concise:
try (Writer w = new FileWriter(...)) {
// write stuff to the file
} // the `w` resource is automatically closed
UPDATE 2
Apparently you can also encounter a "too many files open" while attempting to run an external program. The basic cause is as described above. However, the reason that you encounter this in exec(...) is that the JVM is attempting to create "pipe" file descriptors that will be connected to the external application's standard input / output / error.
For UNIX:
As Stephen C has suggested, changing the maximum file descriptor value to a higher value avoids this problem.
Try looking at your present file descriptor capacity:
$ ulimit -n
Then change the limit according to your requirements.
$ ulimit -n <value>
Note that this just changes the limits in the current shell and any child / descendant process. To make the change "stick" you need to put it into the relevant shell script or initialization file.
You're obviously not closing your file descriptors before opening new ones. Are you on windows or linux?
Although in most general cases the error is quite clearly that file handles have not been closed, I just encountered an instance with JDK7 on Linux that well... is sufficiently ****ed up to explain here.
The program opened a FileOutputStream (fos), a BufferedOutputStream (bos) and a DataOutputStream (dos). After writing to the dataoutputstream, the dos was closed and I thought everything went fine.
Internally however, the dos, tried to flush the bos, which returned a Disk Full error. That exception was eaten by the DataOutputStream, and as a consequence the underlying bos was not closed, hence the fos was still open.
At a later stage that file was then renamed from (something with a .tmp) to its real name. Thereby, the java file descriptor trackers lost track of the original .tmp, yet it was still open !
To solve this, I had to first flush the DataOutputStream myself, retrieve the IOException and close the FileOutputStream myself.
I hope this helps someone.
If you're seeing this in automated tests: it's best to properly close all files between test runs.
If you're not sure which file(s) you have left open, a good place to start is the "open" calls which are throwing exceptions! 😄
If you have a file handle should be open exactly as long as its parent object is alive, you could add a finalize method on the parent that calls close on the file handle. And call System.gc() between tests.
Recently, I had a program batch processing files, I have certainly closed each file in the loop, but the error still there.
And later, I resolved this problem by garbage collect eagerly every hundreds of files:
int index;
while () {
try {
// do with outputStream...
} finally {
out.close();
}
if (index++ % 100 = 0)
System.gc();
}
I'm doing some file I/O with multiple files (writing to 19 files, it so happens). After writing to them a few hundred times I get the Java IOException: Too many open files. But I actually have only a few files opened at once. What is the problem here? I can verify that the writes were successful.
On Linux and other UNIX / UNIX-like platforms, the OS places a limit on the number of open file descriptors that a process may have at any given time. In the old days, this limit used to be hardwired1, and relatively small. These days it is much larger (hundreds / thousands), and subject to a "soft" per-process configurable resource limit. (Look up the ulimit shell builtin ...)
Your Java application must be exceeding the per-process file descriptor limit.
You say that you have 19 files open, and that after a few hundred times you get an IOException saying "too many files open". Now this particular exception can ONLY happen when a new file descriptor is requested; i.e. when you are opening a file (or a pipe or a socket). You can verify this by printing the stacktrace for the IOException.
Unless your application is being run with a small resource limit (which seems unlikely), it follows that it must be repeatedly opening files / sockets / pipes, and failing to close them. Find out why that is happening and you should be able to figure out what to do about it.
FYI, the following pattern is a safe way to write to files that is guaranteed not to leak file descriptors.
Writer w = new FileWriter(...);
try {
// write stuff to the file
} finally {
try {
w.close();
} catch (IOException ex) {
// Log error writing file and bail out.
}
}
1 - Hardwired, as in compiled into the kernel. Changing the number of available fd slots required a recompilation ... and could result in less memory being available for other things. In the days when Unix commonly ran on 16-bit machines, these things really mattered.
UPDATE
The Java 7 way is more concise:
try (Writer w = new FileWriter(...)) {
// write stuff to the file
} // the `w` resource is automatically closed
UPDATE 2
Apparently you can also encounter a "too many files open" while attempting to run an external program. The basic cause is as described above. However, the reason that you encounter this in exec(...) is that the JVM is attempting to create "pipe" file descriptors that will be connected to the external application's standard input / output / error.
For UNIX:
As Stephen C has suggested, changing the maximum file descriptor value to a higher value avoids this problem.
Try looking at your present file descriptor capacity:
$ ulimit -n
Then change the limit according to your requirements.
$ ulimit -n <value>
Note that this just changes the limits in the current shell and any child / descendant process. To make the change "stick" you need to put it into the relevant shell script or initialization file.
You're obviously not closing your file descriptors before opening new ones. Are you on windows or linux?
Although in most general cases the error is quite clearly that file handles have not been closed, I just encountered an instance with JDK7 on Linux that well... is sufficiently ****ed up to explain here.
The program opened a FileOutputStream (fos), a BufferedOutputStream (bos) and a DataOutputStream (dos). After writing to the dataoutputstream, the dos was closed and I thought everything went fine.
Internally however, the dos, tried to flush the bos, which returned a Disk Full error. That exception was eaten by the DataOutputStream, and as a consequence the underlying bos was not closed, hence the fos was still open.
At a later stage that file was then renamed from (something with a .tmp) to its real name. Thereby, the java file descriptor trackers lost track of the original .tmp, yet it was still open !
To solve this, I had to first flush the DataOutputStream myself, retrieve the IOException and close the FileOutputStream myself.
I hope this helps someone.
If you're seeing this in automated tests: it's best to properly close all files between test runs.
If you're not sure which file(s) you have left open, a good place to start is the "open" calls which are throwing exceptions! 😄
If you have a file handle should be open exactly as long as its parent object is alive, you could add a finalize method on the parent that calls close on the file handle. And call System.gc() between tests.
Recently, I had a program batch processing files, I have certainly closed each file in the loop, but the error still there.
And later, I resolved this problem by garbage collect eagerly every hundreds of files:
int index;
while () {
try {
// do with outputStream...
} finally {
out.close();
}
if (index++ % 100 = 0)
System.gc();
}
I am creating temporary files in my application using java.io.File.createTempFile().
While creating file, I have called deleteOnExit() for that File object.
This code is used in many scenarios in my application. Sometimes, the size of temp files is too large and so I have to delete it immediately after my job is completed. So I am calling File.delete() for some objects.
Now the problem is, when I delete the file using delete() function, the reference pointer is open for this deleted file(because of it being temp file(My opinion)). Because of this, I am facing memory leakage issue.
(Correct me if I am wrong on my above hypothesis)
I am facing high disk utilization on my environment, I found a discrepancy of over 30GB in the output of the 'df' and 'du' command ('df' looks at the stat of the FS itself whereas 'du' ignores deleted file descriptors).
If I remove deleteOnExit(), I will have to take care of deleting all the objects manually. Doing this, my pointers are still remaining open(Used lsof +al1 on linux to see open files) Why is this happening?
If I remove delete(), then I will have to wait until VM stops to get tempFiles deleted(which is a very rare case in Production Server). (Huge Space Utilization)
Is there any Solution on how I can remove file from deleteOnExit() list if I am manually deleting the file?
I suspect that your analysis is correct, and that could be seen as a bug in Java: once you call delete, it would be fair to expect the reference created by deleteOnExit to be be removed.
However, we are at least warned (sort of). The Javadoc for deleteOnExit says:
Once deletion has been requested, it is not possible to cancel the request. This method should therefore be used with care.
So I guess calling delete after deleteOnExit would be considered careless.
However, it seems to me that your question implies its own solution. You say:
If I remove delete(), then I will have to wait until VM stops to get tempFiles deleted(which is a very rare case in Production Server).
If the JVM is very rarely ended, then deleteOnExit will very rarely do you any good. Which suggests that the solution is to handle your own deletions, by having your application call delete when it is finished with a file, and note use deleteOnExit at all.
The pointer will remain open until the application releases the resource for that file. Try
fileVar = null;
after your
fileVar.delete();
I see this problem as well. I've temporarily "fixed" it by executing the following, prior to calling delete:
FileChannel outChan = new FileOutputStream(tmpfile, true).getChannel();
outChan.truncate(newSize);
outChan.close();
This at least makes it so that the tmp files don't consume disk space, and df and du report the same stats. It does still leak file descriptors, and I presume it leaks a small amount of heap.
It's noteworthy that File.delete() returns a boolean to indicate if the delete succeeded. It's possible that it's silently failing for you, and you actually have an unclosed stream to the file, which prevents the delete. You may want to try using the following call, which will throw an IOException with diagnostics if it is unable to delete the file.
java.nio.file.Files.delete(tmpfile.toPath)
If that still doesn't isolate the problem for you, I've had some luck using file-leak-detector, which keeps track of the time files are accessed via streams and grabs a stack trace at the time the stream is created. If a stream doesn't get closed, the stack trace can point you to the origin of that stream. Unfortunately, it doesn't cover all forms of file access, such as nio.
Sort of new to Java but I've been able to learn quite a bit quite quickly. There are still many methods that elude me though. I'm writing a Java program that is to run through a bunch of files and test them thoroughly to see if they are a valid level pack file. Thanks to Matt Olenik and his GobFile class (which was only meant to extract files) I was able to figure out a good strategy to get to the important parts of the level pack and their individual map files to determine various details quickly.
However, testing it on 37 files (5 of which aren't level packs), it crashes after testing 35 files. The 35th file is a valid level pack. The error it gives is:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
Iterator iter = itemList.keySet().iterator();
numJKLs = 0;
while (iter.hasNext()) {
String nextFile = (String) iter.next();
if (nextFile.startsWith("jkl\\")) {
System.out.println(((ItemInfo) itemList.get(nextFile)).length);
--> byte[] buffer = new byte[((ItemInfo) itemList.get(nextFile)).length];
gobRaf.seek(((ItemInfo) itemList.get(nextFile)).offset);
gobRaf.read(buffer, 0,
((ItemInfo) itemList.get(nextFile)).length);
String[] content = new String(buffer).toLowerCase().split(
"\n");
levels.add(numJKLs, new JKLFile(theFile, nextFile, content));
gametype = levels.get(numJKLs).getFileType();
progress.setProgText(nextFile);
buffer = null;
content = null;
numJKLs++;
}
}
Arrow shows where the error is marked. The ´JKLFile´ class reads the content array for these important parts, but should (in theory) dispose of it when done. Like here, I set content = null; in JKLFile just to make sure it is gone. If you are wondering how big the map file it stopped on is, well, it managed to pass a 17 Mb map file, but this one was only 5 Mb.
As you can see these JKLFile objects are kept in this object (GobFile.java) for easy access later on, and the GobFile objects are kept in another class for later access, until I change directory (not implemented yet). But these objects shouldn't be too packed with stuff. Just a list of file names and various details.
Any suggestions on how I can find out where the memory is going or what objects are using the most resources? It would be nice to keep the details in memory rather than having to load the file again (up to 2 seconds) when I click on them from a list.
/Edward
Right now you are creating a new buffer for every loop. Depending on your garbage collection some of these may not get recycled/freed from memory and you are then running out. You can call the garbage collector using System.gc() but there is no guarantee that it actually runs at any given time.
You could increase the memory allocated to the JVM on the command line when calling your program with the -Xmx and -Xms arguments. Also if you are not on 32bit windows ensure you are using a 64bit JVM as the 32bit has memory limitations that the former does not. Run java -version from the command line to see the current system JVM.
You can move the declaration for your byte[] buffer outside your loop (above your while) so that it gets cleared and reused on each iteration.
Iterator iter = itemList.keySet().iterator();
numJKLs = 0;
byte[] buffer;
while (iter.hasNext()) {
...
buffer = new byte[] ...
...
}
How big are the files you are handling? Do you really need to add every one of them to your levels variable? Could you make do with some file info or stats instead? How much RAM has your JVM currently got allocated?
To look at the current memory usage at any given time you will need to debug your program as it runs. Look up how to do that for your IDE of choice and you will be able to follow any Object's current state through the program's lifespan.
here is my code:
public void mapTrace(String Path) throws FileNotFoundException, IOException {
FileReader arq = new FileReader(new File(Path));
BufferedReader leitor = new BufferedReader(arq, 41943040);
Integer page;
String std;
Integer position = 0;
while ((std = leitor.readLine()) != null) {
position++;
page = Integer.parseInt(std, 16);
LinkedList<Integer> values = map.get(page);
if (values == null) {
values = new LinkedList<>();
map.put(page, values);
}
values.add(position);
}
for (LinkedList<Integer> referenceList : map.values()) {
Collections.reverse(referenceList);
}
}
This is the HashMap structure
Map<Integer, LinkedList<Integer>> map = new HashMap<>();
For 50mb - 100mb trace files i don't have any problem, but for bigger files i have:
Exception in thread "AWT-EventQueue-0" java.lang.OutOfMemoryError: GC overhead limit exceeded
I don't know if the reverse method is increasing the memory use, if the LinkedList is using more space than other List structure or if the way i'm adding the list to the map is taking more space than it should. Does anyone can tell me what's using so much space?
Does anyone can tell me what's using so much space?
The short answer is that it is probably the space overheads of the data structure you have chosen that is using the space.
By my reckoning, a LinkedList<Integer> on a 64 bit JVM uses about 48 bytes of storage per integer in the list including the integers themselves.
By my reckoning, a Map<?, ?> on a 64 bit machine will use in the region of 48 bytes of storage per entry excluding the space need to represent the key and the value objects.
Now, your trace size estimates are rather too vague for me to plug the numbers in, but I'd expect a 1.5Gb trace file to need a LOT more than 2Gb of heap.
Given the numbers you've provided, a reasonable rule-of-thumb is that a trace file will occupy roughly 10 times its file size in heap memory ... using the data structure that you are currently using.
You don't want to configure a JVM to try to use more memory than the physical RAM available. Otherwise, you are liable to push the machine into thrashing ... and the operating system is liable to start killing processes. So for an 8Gb machine, I wouldn't advise going over -Xmx8g.
Putting that together, with an 8Gb machine you should be able to cope with a 600Mb trace file (assuming my estimates are correct), but a 1.5Gb trace file is not feasible. If you really need to handle trace files that big, my advice would be to either:
design and implement custom collection types for your specific use-case that use memory more efficiently,
rethink your algorithms so that you don't need to hold the entire trace files in memory, or
get a bigger machine.
I did some tests before reading your comment, i put -Xmx14g and processed the 600mb file, it took some minutes(about 10) but it did fine.
The -Xmx14g option sets the maximum heap size. Based on the observed behaviour, I expect that the JVM didn't need anywhere like that much memory ... and didn't request it from the OS. And if you'd looked at memory usage in the task manager, I expect you'd have seen numbers consistent with that.
Then i put -Xmx18g and tried to process the 1,5gb file, and its been running for about 20 minutes. My memory in the task manager is going from 7,80 to 7,90. I wonder if this will finish, how could i use MORE memory than i have? Does it use the HD as virtual memory?
Yes that it is what it does.
Yes, each page of your processes virtual address space corresponds to a page on the hard disc.
If you've got more virtual pages than physical memory pages, at any given time some of those virtual memory pages will live on disk only. When your application tries to use a one of those non-resident pages, the VM hardware generates an interrupt, and the operating system finds an unused page and populates it from the disc copy and then hands control back to your program. But if your application is busy, then it will have had to make that physical memory page by evicting another page. And that may have involved writing the contents of the evicted page to disc.
The net result is that when you try to use significantly more virtual address pages than you have physical memory, the application generates lots of interrupts that result in lots of disc reads and writes. This is known as thrashing. If your system thrashes too badly, the system will spend most of its waiting for disc reads and writes to finish, and performance will drop dramatically. And on some operating systems, the OS will attempt to "fix" the problem by killing processes.
Further to Stephen's quite reasonable answer, everything has its limit and your code simply isn't scalable.
In case where the input is "large" (as in your case), the only reasonable approach is a stream based approach, which while (usually) more complicated to write, uses very little memory/resources. Essentially you hold in memory only what you need to process the current task then release it asap.
You may find that unix command line tools are your best weapon, perhaps using a combination of awk, sed, grep etc to massage your raw data into hopefully a usable "end format".
I once stopped a colleague from writing a java program to read in and parse XML and issue insert statements to a database: I showed him how to use a series of piped commands to produce executable SQL which was then piped directly into the database command line tool. Took about 30 minutes to get it right, but job done. And the file was massive , so in java it would have required a SAC parser and JDBC, which aren't fun.
to build this structure, I would put those data in a key/value datastore like berkeleydb for java.
peusdo-code
putData(db,page,value)
{
Entry key=new Entry();
Entry data=new Entry();
List<Integer> L=new LinkedList<Integer>();;
IntegerBinding.intToEntry(page,key);
if(db.get(key,data)==OperationStatus.SUCCESS)
{
TupleInput t=new TupleInput(data);
int n=t.readInt();
for(i=0;i< n;++n) L.add(n);
}
L.add(value);
TupleOutput out=new TupleOutput();
out.writeInt(L.size());
for(int v: L) out.writeInt(v);
data=new Entry(out.toByteArray());
db.put(key,data);
}