Sort of new to Java but I've been able to learn quite a bit quite quickly. There are still many methods that elude me though. I'm writing a Java program that is to run through a bunch of files and test them thoroughly to see if they are a valid level pack file. Thanks to Matt Olenik and his GobFile class (which was only meant to extract files) I was able to figure out a good strategy to get to the important parts of the level pack and their individual map files to determine various details quickly.
However, testing it on 37 files (5 of which aren't level packs), it crashes after testing 35 files. The 35th file is a valid level pack. The error it gives is:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
Iterator iter = itemList.keySet().iterator();
numJKLs = 0;
while (iter.hasNext()) {
String nextFile = (String) iter.next();
if (nextFile.startsWith("jkl\\")) {
System.out.println(((ItemInfo) itemList.get(nextFile)).length);
--> byte[] buffer = new byte[((ItemInfo) itemList.get(nextFile)).length];
gobRaf.seek(((ItemInfo) itemList.get(nextFile)).offset);
gobRaf.read(buffer, 0,
((ItemInfo) itemList.get(nextFile)).length);
String[] content = new String(buffer).toLowerCase().split(
"\n");
levels.add(numJKLs, new JKLFile(theFile, nextFile, content));
gametype = levels.get(numJKLs).getFileType();
progress.setProgText(nextFile);
buffer = null;
content = null;
numJKLs++;
}
}
Arrow shows where the error is marked. The ´JKLFile´ class reads the content array for these important parts, but should (in theory) dispose of it when done. Like here, I set content = null; in JKLFile just to make sure it is gone. If you are wondering how big the map file it stopped on is, well, it managed to pass a 17 Mb map file, but this one was only 5 Mb.
As you can see these JKLFile objects are kept in this object (GobFile.java) for easy access later on, and the GobFile objects are kept in another class for later access, until I change directory (not implemented yet). But these objects shouldn't be too packed with stuff. Just a list of file names and various details.
Any suggestions on how I can find out where the memory is going or what objects are using the most resources? It would be nice to keep the details in memory rather than having to load the file again (up to 2 seconds) when I click on them from a list.
/Edward
Right now you are creating a new buffer for every loop. Depending on your garbage collection some of these may not get recycled/freed from memory and you are then running out. You can call the garbage collector using System.gc() but there is no guarantee that it actually runs at any given time.
You could increase the memory allocated to the JVM on the command line when calling your program with the -Xmx and -Xms arguments. Also if you are not on 32bit windows ensure you are using a 64bit JVM as the 32bit has memory limitations that the former does not. Run java -version from the command line to see the current system JVM.
You can move the declaration for your byte[] buffer outside your loop (above your while) so that it gets cleared and reused on each iteration.
Iterator iter = itemList.keySet().iterator();
numJKLs = 0;
byte[] buffer;
while (iter.hasNext()) {
...
buffer = new byte[] ...
...
}
How big are the files you are handling? Do you really need to add every one of them to your levels variable? Could you make do with some file info or stats instead? How much RAM has your JVM currently got allocated?
To look at the current memory usage at any given time you will need to debug your program as it runs. Look up how to do that for your IDE of choice and you will be able to follow any Object's current state through the program's lifespan.
Related
In my project, we have a requirement to read a very large file, where each line has identifiers separated by a special character ( "|"). Unfortunately I can't use parallelism, since it is necessary to make a validation between the last character of a line with the first of the next line, to decide whether it will or not be extracted. Anyway, the requirement is very simple: break the line into tokens, analyze them and store only some of them in memory. The code is very simple, something like below:
final LineIterator iterator = FileUtils.lineIterator(file)
while(iterator.hasNext()){
final String[] tokens = iterator.nextLine().split("\\|");
//process
}
But this little piece of code is very, very inefficient. The method split() generates too many temporary objects that are not been collected (as best explained here: http://chrononsystems.com/blog/hidden-evils-of-javas-stringsplit-and-stringr.
For comparison purposes: a 5mb file was using around 35 mb memory at the end of file process.
I tested some alternatives like:
Using a pre compiled pattern (Performance of StringTokenizer class vs. split method in Java)
Use Guava's Splitter (Java split String performances)
Optimize String storage (http://java-performance.info/string-packing-converting-characters-to-bytes/)
Use of optimized collections (http://blog.takipi.com/5-coding-hacks-to-reduce-gc-overhead)
But none of them appears to be efficient enough. Using JProfiler, I could see that the amount of memory used by temporary objects is too high (35 mb used, but only 15 mb is actually been used by valid objects).
Then I decide make a simple test: after 50,000 lines read, explicit call to System.gc(). And then, at the end of process, the memory usage has decreased from 35 mb to 16mb. I tested many, many times, and always got the same result.
I know invoke that invoke of System.gc () is a bad practice (as indicated in Why is it bad practice to call System.gc()?). But is there is any other alternative in a cenario where the split() method could be invoked millions of times?
[UPDATE]
I use a 5 mb file only for test purpose, but the system should process much larger files (500Mb ~ 1Gb)
The first and most important thing to say here is, don't worry about it. The JVM is consuming 35MB of RAM because it's configuration says that's a low enough amount. When its highly efficient GC algorithm decides it's time, it will sweep all those objects away, no problem.
If you really want to, you can invoke Java with memory management options (e.g. java -Xmxn=...) -- I suggest it's not worth doing unless you're running on very limited hardware.
However, if you really want to avoid allocating an array of String each time you process a line, there are many ways to do so.
One way is to use a StringTokenizer:
StringTokenizer st = new StringTokenizer(line,"|");
while (st.hasMoreElements()) {
process(st.nextElement());
}
You could also avoid consuming a line at a time. Get your file as a stream, use a StreamTokenizer, and consume one token at a time in this way.
Read the API docs for Scanner, BufferedInputStream, Reader -- there are lots of choices in this area, because you're doing something fundamental.
However, none of these will cause Java to GC sooner or more aggressively. If the JRE doesn't consider itself short of memory, it won't collect any garbage.
Try writing something like this:
public static void main(String[] args) {
Random r = new Random();
Integer x;
while(true) {
x = Integer.valueof(r.nextInt());
}
}
Run it and watch your JVM's heap size as it runs (put a sleep in if the usage shoots up too quickly to see). Each time around the loop, Java creates what you call a 'temporary object' of type Integer. All of these stay in the heap until the GC decides it needs to clear them away. You'll see that it won't do this until it reaches a certain level. But when it reaches that level, it will do a good job of ensuring that its limits are never exceeded.
You should adjust your way of analyzing situations. While the article about the regex compilation under the hood is correct in general, it doesn’t apply here. When you look at the source code of String.split(String), you’ll see that it just delegates to String.split(String,int) which has a special code path for patterns consisting of just one literal character, including escaped ones like your \|.
The only temporary object created within that code path is an ArrayList. The regex package is not involved at all; this fact might help you understanding why precompiling a regex pattern did not improve the performance here.
When you use a Profiler to come to the conclusion that there are too many objects, you should use it also to find out what kinds of objects there are and where they originate, instead of doing wild guessing.
But it’s not clear, why you complain at all. You can configure the JVM to use a certain maximum memory. As long as that maximum has not been reached, the JVM just does what you told it, using that memory rather than wasting CPU cycles just to not using the available memory. Where’s the sense in not using the available memory?
I have a performance problem that I can't get my head around. I am writing a Java application that parses huge (> 20 million lines) text files and stores certain information in a Set.
I measure the performance in seconds per million lines. Since I need a lot of memory, I usually run the program with -Xmx6000m and -Xms4000m.
If I just run the program, It parses 1 Million lines in about 6 seconds. However, I realized after some performance investigations, that if I add this code before the actual parsing routine, performance increases to under 3 seconds per 1 million lines:
BufferedReader br = new BufferedReader(new FileReader("graphs.nt"));
HashMap<String, String> foo = new HashMap<String, String>();
String line;
while ((line = br.readLine()) != null){
foo.put(line, "foo");
}
foo = null;
br.close();
br = null;
The graphs.nt file is about 9 million lines long. The performance increase persists even if I do not set foo to null, this is mainly to demonstrate that the map is in fact not used by the program.
The rest of the code is completely unrelated. I use a parser from openrdf sesame to read a different (not the graphs.nt) file and store extracted information in a new HashSet, created by another object.
In the rest of the code, I create a Parser object, to which I pass a Handler object.
This really confuses me. My guess is, that this somehow drives the JVM to allocate more memory for my program, which I can see hints for when I run top. Without the HashMap, it will allocate about 1 Gig of memory. If I initialize the HashMap, it will allocate > 2 Gigs.
My question is, if this sounds at all reasonable. Is it possible that creating such a big object will allocate more memory for the program to use afterwards? Shouldn't -Xmx and -Xms control the memory allocation or are there further arguments that maybe play a role here?
I am aware that this may seem like an odd question and that information is scarce, but this is all the information that I found related to the issue. If there is any more information that may be helpful, I am more than happy to provide it.
Memory and GC can definitely impact performance. If possible you should run Xms==Xmx to disable resizing, and give JVM plenty of room at start. Your app could exit before any major GC is needed.
Unless you go out of your way to make it otherwise, "foo" will eventually pass out of scope and be collected, even if you don't nil the pointer, and even if the method containing the above code is never exited. But it will have forced the heap to grow larger, and this will reduce the relative overhead of GC.
(It would be an interesting experiment to reference "foo" at the end of your program, to keep it in scope.)
This sounds like file caching? Your file "graphs.nt" is probably cached in RAM either by the OS or the JVM. GC will allow memory consumption to go up for performance reasons, if you add a forced collect right after your preload, System.gc(), you'll be able to tell if the caching happens in the JVM or in the OS.
I need to write a list of words to a file and then save the file on a disk. Is one of the following two ways better than the other? The second one obviously uses more main memory but is there a difference in speed?
(this is just pseudocode)
for i = 0 to i = n:
word = generateWord();
FileWriter.println(word);
end loop
versus
String [] listOfWords = new List
for i = 0 to i = n:
word = generateWord();
listOfWords.add(word)
end loop
for i = 0 to n:
FileWriter.println(listOfWords[i]);
end loop
These two methods you show are exactly the same in terms of disk usage efficiency.
When thinking about speed of disk writes, you must always take into account what kind of writer object you are using. There are many types of writer objects and each of them may behave differently when it comes to actual disk writes.
If the object you are using is one of those that write the exact data you tell it to, then your way of writing is very inefficient. You should consider switching to another writer (BufferedWriter for example) or building a longer string before writing it.
In general, you should try to write data in chunks that fit the disk's chunk size.
Between your code and the disk, you have a stack something like: Java library code, a virtual machine runtime, the C runtime library, the operating system file cache/virtual memory subsystem, the operating system I/O scheduler, a device driver and the physical disk firmware.
Just do the simplest thing possible unless profiling shows a problem. Several of those layers will already be tuned to handle buffering, batching and scheduling sequential writes since they're such a common use case.
From FileWriters standpoint you are doing the exacty same thing in both examples, so clearly there cannot be any difference regarding file I/O. And, as you say, the first one's space complexity is O(1), as opposed to second one's O(N).
here is my code:
public void mapTrace(String Path) throws FileNotFoundException, IOException {
FileReader arq = new FileReader(new File(Path));
BufferedReader leitor = new BufferedReader(arq, 41943040);
Integer page;
String std;
Integer position = 0;
while ((std = leitor.readLine()) != null) {
position++;
page = Integer.parseInt(std, 16);
LinkedList<Integer> values = map.get(page);
if (values == null) {
values = new LinkedList<>();
map.put(page, values);
}
values.add(position);
}
for (LinkedList<Integer> referenceList : map.values()) {
Collections.reverse(referenceList);
}
}
This is the HashMap structure
Map<Integer, LinkedList<Integer>> map = new HashMap<>();
For 50mb - 100mb trace files i don't have any problem, but for bigger files i have:
Exception in thread "AWT-EventQueue-0" java.lang.OutOfMemoryError: GC overhead limit exceeded
I don't know if the reverse method is increasing the memory use, if the LinkedList is using more space than other List structure or if the way i'm adding the list to the map is taking more space than it should. Does anyone can tell me what's using so much space?
Does anyone can tell me what's using so much space?
The short answer is that it is probably the space overheads of the data structure you have chosen that is using the space.
By my reckoning, a LinkedList<Integer> on a 64 bit JVM uses about 48 bytes of storage per integer in the list including the integers themselves.
By my reckoning, a Map<?, ?> on a 64 bit machine will use in the region of 48 bytes of storage per entry excluding the space need to represent the key and the value objects.
Now, your trace size estimates are rather too vague for me to plug the numbers in, but I'd expect a 1.5Gb trace file to need a LOT more than 2Gb of heap.
Given the numbers you've provided, a reasonable rule-of-thumb is that a trace file will occupy roughly 10 times its file size in heap memory ... using the data structure that you are currently using.
You don't want to configure a JVM to try to use more memory than the physical RAM available. Otherwise, you are liable to push the machine into thrashing ... and the operating system is liable to start killing processes. So for an 8Gb machine, I wouldn't advise going over -Xmx8g.
Putting that together, with an 8Gb machine you should be able to cope with a 600Mb trace file (assuming my estimates are correct), but a 1.5Gb trace file is not feasible. If you really need to handle trace files that big, my advice would be to either:
design and implement custom collection types for your specific use-case that use memory more efficiently,
rethink your algorithms so that you don't need to hold the entire trace files in memory, or
get a bigger machine.
I did some tests before reading your comment, i put -Xmx14g and processed the 600mb file, it took some minutes(about 10) but it did fine.
The -Xmx14g option sets the maximum heap size. Based on the observed behaviour, I expect that the JVM didn't need anywhere like that much memory ... and didn't request it from the OS. And if you'd looked at memory usage in the task manager, I expect you'd have seen numbers consistent with that.
Then i put -Xmx18g and tried to process the 1,5gb file, and its been running for about 20 minutes. My memory in the task manager is going from 7,80 to 7,90. I wonder if this will finish, how could i use MORE memory than i have? Does it use the HD as virtual memory?
Yes that it is what it does.
Yes, each page of your processes virtual address space corresponds to a page on the hard disc.
If you've got more virtual pages than physical memory pages, at any given time some of those virtual memory pages will live on disk only. When your application tries to use a one of those non-resident pages, the VM hardware generates an interrupt, and the operating system finds an unused page and populates it from the disc copy and then hands control back to your program. But if your application is busy, then it will have had to make that physical memory page by evicting another page. And that may have involved writing the contents of the evicted page to disc.
The net result is that when you try to use significantly more virtual address pages than you have physical memory, the application generates lots of interrupts that result in lots of disc reads and writes. This is known as thrashing. If your system thrashes too badly, the system will spend most of its waiting for disc reads and writes to finish, and performance will drop dramatically. And on some operating systems, the OS will attempt to "fix" the problem by killing processes.
Further to Stephen's quite reasonable answer, everything has its limit and your code simply isn't scalable.
In case where the input is "large" (as in your case), the only reasonable approach is a stream based approach, which while (usually) more complicated to write, uses very little memory/resources. Essentially you hold in memory only what you need to process the current task then release it asap.
You may find that unix command line tools are your best weapon, perhaps using a combination of awk, sed, grep etc to massage your raw data into hopefully a usable "end format".
I once stopped a colleague from writing a java program to read in and parse XML and issue insert statements to a database: I showed him how to use a series of piped commands to produce executable SQL which was then piped directly into the database command line tool. Took about 30 minutes to get it right, but job done. And the file was massive , so in java it would have required a SAC parser and JDBC, which aren't fun.
to build this structure, I would put those data in a key/value datastore like berkeleydb for java.
peusdo-code
putData(db,page,value)
{
Entry key=new Entry();
Entry data=new Entry();
List<Integer> L=new LinkedList<Integer>();;
IntegerBinding.intToEntry(page,key);
if(db.get(key,data)==OperationStatus.SUCCESS)
{
TupleInput t=new TupleInput(data);
int n=t.readInt();
for(i=0;i< n;++n) L.add(n);
}
L.add(value);
TupleOutput out=new TupleOutput();
out.writeInt(L.size());
for(int v: L) out.writeInt(v);
data=new Entry(out.toByteArray());
db.put(key,data);
}
Last summer, I made a Java application that would parse some PDF files and get the information they contain to store them in a SQLite database.
Everything was fine and I kept adding new files to the database every week or so without any problems.
Now, I'm trying to improve my application's speed and I wanted to see how it would fare if I parsed all the files I have from the last two years in a new database. That's when I started getting this error: OutOfMemoryError: Java Heap Space. I didn't get it before because I was only parsing about 25 new files per week, but it seems like parsing 1000+ files one after the other is a lot more demanding.
I partially solved the problem: I made sure to close my connection after every call to the database and the error went away, but at a huge cost. Parsing the files is now unbearably slow. As for my ResultSets and Statements / PreparedStatements, I'm already closing them after every call.
I guess there's something I don't understand about when I should close my connection and when I should keep re-using the same one. I thought that since auto-commit is on, it commits after every transaction (select, update, insert, etc.) and the connection releases the extra memory it was using. I'm probably wrong since when I parse too many files, I end up getting the error I'm mentioning.
An easy solution would be to close it after every x calls, but then again I won't understand why and I'm probably going to get the same error later on. Can anyone explain when I should be closing my connections (if at all except when I'm done)? If I'm only supposed to do it when I'm done, then can someone explain how I'm supposed to avoid this error?
By the way, I didn't tag this as SQLite because I got the same error when I tried running my program on my online MySQL database.
Edit
As it has been pointed out by Deco and Mavrav, maybe the problem isn't my Connection. Maybe it's the files, so I'm going to post the code I use to call the function to parse the files one by one:
public static void visitAllDirsAndFiles(File dir){
if (dir.isDirectory()){
String[] children = dir.list();
for (int i = 0; i < children.length; i++){
visitAllDirsAndFiles(new File(dir, children[i]));
}
}
else{
try{
// System.out.println("File: " + dir);
BowlingFilesReader.readFile(dir, playersDatabase);
}
catch (Exception exc){
System.out.println("Other exception in file: " + dir);
}
}
}
So if I call the method using a directory, it recursively calls the function again using the File object I just created. My method then detects that it's a file and calls BowlingFilesReader.readFile(dir, playersDatabase);
The memory should be released when the method is done I think?
Your first instinct on open resultsets and connections was good, though maybe not entirely the cause. Let's start with your database connection first.
Database
Try using a database connection pooling library, such as the Apache Commons DBCP (BasicDataSource is a good place to start): http://commons.apache.org/dbcp/
You will still need to close your database objects, but this will keep things running smoothly on the database front.
JVM Memory
Increase the size of the memory you give to the JVM. You may do so by adding -Xmx and a memory amount after, such as:
-Xmx64m <- this would give the JVM 64 megs of memory to play with
-Xmx512m <- 512 megs
Be careful with your numbers, though, throwing more memory at the JVM will not fix memory leaks. You may use something like JConsole or JVisualVM (included in your JDK's bin/ folder) to observe how much memory you are using.
Threading
You may increase the speed of your operations by threading them out, assuming the operation you are performing to parse these records is threadable. But more information might be necessary to answer that question.
Hope this helps.
As it happens with Garbage colleciton I dont think the memory would be immediately recollected for the subsequent processes and threads.So we cant entirely put our eggs in that basket.To begin with put all the files in a directory and not in child directories of the parent. Then load the file one by one by iterating like this
File f = null;
for (int i = 0; i < children.length; i++){
f = new File(dir, children[i]);
BowlingFilesReader.readFile(f, playersDatabase);
f = null;
}
So we are invalidating the reference so that the file object is released and will be picked up in the subsequent GC. And to check the limits test it by increasing the no. of files start with 100, 200 ..... and then we will know at what point OME is getting thrown.
Hope this helps.