Java heap space error at large files with string.split

Java heap space error at large files with string.split - java

I have in the line a heap space error on an another machine, but it runs on my machine
I can't chance the properties of the another machine.
How can I solve this problem without using
Scanner.java ?
Is the argument of string.split correct with " " for split after spaces to split the String in pieces?
[File:]
U 1 234.003 30 40 50 true
T 2 234.003 10 60 40 false
Z 3 17234.003 30 40 50 true
M 4 0.500 30 40 50 true
/* 1000000+ lines */
java.lang.OutOfMemoryError: Java heap space
at java.base/java.util.Arrays.copyOfRange(Arrays.java:3821)
at java.base/java.lang.StringLatin1.newString(StringLatin1.java:764)
at java.base/java.lang.String.substring(String.java:1908)
at java.base/java.lang.String.split(String.java:2326)
at java.base/java.lang.String.split(String.java:2401)
at project.FileR(Fimporter.java:99)
public static DataBase File(String filename) throws IOException {
BufferedReader fs = new BufferedReader(new FileReader(filename),64 * 1024);
String line;
String[] wrds;
String A; int hash; double B; int C; int D; boolean E; DataBase DB = new DataBase();
while (true) {
line = fs.readLine();
if (line == null) {break;}
wrds = line.split(" "); /* this is line 99 in the error-message */
hash = Integer.parseInt(wrds[1]);
B = Double.parseDouble(wrds[2]);
C = Integer.parseInt(wrds[3]);
D = Integer.parseInt(wrds[4]);
E = Boolean.parseBoolean(wrds[5]);
// hash is hashcode for all values B C D E in DataBase DB
DB.listB.put(hash,B);
DB.listC.put(hash,C);
DB.listD.put(hash,D);
DB.listE.put(hash,E);
}

How can I solve this problem without using Scanner.java ?
Scanner is not the issue.
If you are getting OOME's with this code, the most likely root cause is the following:
DB.listB.put(hash,B);
DB.listC.put(hash,C);
DB.listD.put(hash,D);
DB.listE.put(hash,E);
You appear to loading all of your data into 4 maps. (You haven't shown us the relevant code ... but I am making an educated guess here.)
My second guess is that your input files are very large, and the amount of memory needed to hold them in the above data structures is simply too large for the "other" machine's heap.
The fact that the OOME's are occurring in a String.split call is not indicative of a problem in split per se. This is just the proverbial "straw that broke the camel's back". The root cause of the problem is in what you are doing with the data after splitting it.
Possible solutions / workarounds:
Increase the heap size on the "other" machine. If you haven't set the -Xmx or -Xms options, the JVM will use the default max heap size ... which is typically 1/4 of the physical memory.
Read the command documentation for the java command to understand what -Xmx and -Xms do and how to set them.
Use more memory efficient data structures:
Create a class to represent a tuple consisting of B, C, D, E values. Then replace the 4 maps with a map of these tuples.
Use a more memory efficient Map type.
Consider using a sorted array of tuples (including the hash) and using binary search to look them up.
Redesign your algorithms so that they don't need all of the data in memory at the same time; e.g. split the input into smaller files and process them separately. (This may not be possible ....)

If I'am not mistaken, you can allocate more heap-size when launching your jar file, e.g.:
java -Xmx256M -jar MyApp.jar
which means, you can change those settings.
But then again, just increasing the heap-size will not get rid of that problem, if the files get bigger, chances of oom increase.
You could think of splitting large files before processing like only process first X lines, then force GC to run (by nulling) and then process next lines.

Related

Memory usage in loading a 226MB text file

I have to read a text file of 226mb made like this:
0 25
1 1382
2 99
3 3456
4 921
5 1528
6 578
7 122
8 528
9 81
the first number is a index, the second a value. I want to load a vector of short reading this file (8349328 positions), so I wrote this code:
Short[] docsofword = new Short[8349328];
br2 = new BufferedReader(new FileReader("TermOccurrenceinCollection.txt"));
ss = br2.readLine();
while(ss!=null)
{
docsofword[Integer.valueOf(ss.split("\\s+")[0])] = Short.valueOf(ss.split("\\s+")[1]); //[indexTerm] - numOccInCollection
ss = br2.readLine();
}
br2.close();
It turns out that the entire load takes an incredible amount of memory of 4.2GB. Really i don't understand why, i expected a 15MB vector.
Thanks for any answer.

There are multiple effects at work here.
First, you declared your array as type Short[] insted of short[]. The former is a reference type, meaning each value is wrapped into an instance of Short, consuming the overhead of a full blown object (most likely 16 bytes instead of two). This also inflates each array slot from two bytes to the reference size (generally 4 or 8 bytes, depending on heap size and 32/64 bit VM). The minimum size you can expect for the fully populated array is thus approximately: 8349328 x 20 = 160MB.
Your reading code is happily producing tons of garbage objects - you are using again a wrapper type (Integer) to address the array where a simple int would do. Thats at least 16 bytes of garbage where it would be zero with int. String.split is another culprit, you force the compilation of two regular expressions per line, plus create two strings. Thats numerous short lived objects that become garbage for each line. All of that could be avoided with a few more lines of code.
So you have a relatively memory hungry array, and lots of garbage. The garbage memory can be cleaned up, but the JVM decides when. The decision is based on available maximum heap memory and garbage collector parameters. If you supplied no arguments for either, the JVM will happily fill your machines memory before it attempts to reclaim garbage.
TLDR: Inefficient reading code paired with no JVM parameters.

If file is generated by you, use objectOutputStream, It very easy way to read the file.
As #Durandal, change the code accordingly. I am giving sample code below.
short[] docsofword = new short[8349328];
br2 = new BufferedReader(new FileReader("TermOccurrenceinCollection.txt"));
ss = br2.readLine();
int strIndex, index;
while(ss!=null)
{
strIndex = ss.indexOf( ' ' );
index = Integer.parseInt(ss.subStr(0, strIndex));
docsofword[index] = Short.parseShort(ss.subStr(strIndex+1));
ss = br2.readLine();
}
br2.close();
Even you can optimise further. Instead of indexOf() we can write our own method, when char is matching to space, parse string as integer. After that we will get indexOf Space and index for get remain string.

Why is Java HashMap slowing down?

I try to build a map with the content of a file and my code is as below:
System.out.println("begin to build the sns map....");
String basePath = PropertyReader.getProp("oldbasepath");
String pathname = basePath + "\\user_sns.txt";
FileReader fr;
Map<Integer, List<Integer>> snsMap =
new HashMap<Integer, List<Integer>>(2000000);
try {
fr = new FileReader(pathname);
BufferedReader br = new BufferedReader(fr);
String line;
int i = 1;
while ((line = br.readLine()) != null) {
System.out.println("line number: " + i);
i++;
String[] strs = line.split("\t");
int key = Integer.parseInt(strs[0]);
int value = Integer.parseInt(strs[1]);
List<Integer> list = snsMap.get(key);
//if the follower is not in the map
if(snsMap.get(key) == null)
list = new LinkedList<Integer>();
list.add(value);
snsMap.put(key, list);
System.out.println("map size: " + snsMap.size());
}
} catch (IOException e) {
e.printStackTrace();
}
System.out.println("finish building the sns map....");
return snsMap;
The program is very fast at first but gets much slowly when the information printed is :
map size: 1138338
line number: 30923602
map size: 1138338
line number: 30923603
....
I try to find to reason with two System.out.println() clauses to judge the preformance of BufferedReader and HashMap instead of a Java profiler.
Sometimes it takes a while to get the information of the map size after getting the line number information, and sometimes, it takes a while to get the information of the line number information after get the map size. My question is: which makes my program slow? the BufferedReader for a big file or HashMap for a big map?

If you are testing this from inside Eclipse, you should be aware of the huge performance penalty of writing to stdout/stderr, due to Eclipse's capturing that ouptut in the Console view. Printing inside a tight loop is always a performance issue, even outside of Eclipse.
But, if what you are complaining about is the slowdown experienced after processing 30 million lines, then I bet it's a memory issue. First it slows down due to intense GC'ing and then it breaks with OutOfMemoryError.

You will have to check you program with some profiling tools to understand why it is slow.
In general file access is much more slower than in memory operations (unless you are constrained in memory and doing excess GC) so the guess would be that reading file could be the slower here.

Before you profiled, you will not know what is slow and what isn't.
Most likely, the System.out will show up as being the bottleneck, and you'll then have to profile without them again. System.out is the worst thing you can do for finding performance bottlenecks, because in doing so you usually add an even worse bottleneck.
An obivous optimization for your code is to move the line
snsMap.put(key, list);
into the if statement. You only need to put this when you created a new list. Otherwise, the put will just replace the current value with itself.
Java cost associated with Integer objects (and in particular the use of Integers in the Java Collections API) is largely a memory (and thus Garbage Collection!) issue. You can sometimes get significant gains by using primitive collections such as GNU trove, depending how well you can adjust your code to use them efficiently. Most of the gains of Trove are in memory usage. Definitely try rewriting your code to use TIntArrayList and TIntObjectMap from GNU trove. I'd avoid linked lists, too, in particular for primitive types.
Roughly estimated, a HashMap<Integer, List<Integer>> needs at least 3*16 bytes per entry. The doubly linked list again needs at least 2*16 bytes per entry stored. 1m keys + 30m values ~ 1 GB. No overhead included yet. With GNU trove TIntObjectHash<TIntArrayList> that should be 4+4+16 bytes per key and 4 bytes per value, so 144 MB. The overhead is probably similar for both.
The reason that Trove uses less memory is because the types are specialized for primitive values such as int. They will store the int values directly, thus using 4 bytes to store each.
A Java collections HashMap consists of many objects. It roughly looks like this: there are Entry objects that point to a key and a value object each. These must be objects, because of the way generics are handled in Java. In your case, the key will be an Integer object, which uses 16 bytes (4 bytes mark, 4 bytes type, 4 bytes actual int value, 4 bytes padding) AFAIK. These are all 32 bit system estimates. So a single entry in the HashMap will probably need some 16 (entry) + 16 (Integer key) + 32 (yet empty LinkedList) bytes of memory that all need to be considered for garbage collection.
If you have lots of Integer objects, it just will take 4 times as much memory as if you could store everything using int primitives. This is the cost you pay for the clean OOP principles realized in Java.

The best way is to run your program with profiler (for example, JProfile) and see what parts are slow. Also debug output can slow your program, for example.

Hash Map is not slow, but in reality its the fastest among the maps. HashTable is the only thread safe among maps, and can be slow sometimes.
Important note: Close the BufferedReader and File after u read the data... this might help.
eg: br.close()
file.close()
Please check you system processes from task manager, there may be too may processes running in the background.
Sometimes eclipse is real resource heavy, so try to run it from console to check it.

Poor performance with large Java lists

I'm trying to read a large text corpus into memory with Java. At some point it hits a wall and just garbage collects interminably. I'd like to know if anyone has experience beating Java's GC into submission with large data sets.
I'm reading an 8 GB file of English text, in UTF-8, with one sentence to a line. I want to split() each line on whitespace and store the resulting String arrays in an ArrayList<String[]> for further processing. Here's a simplified program that exhibits the problem:
/** Load whitespace-delimited tokens from stdin into memory. */
public class LoadTokens {
private static final int INITIAL_SENTENCES = 66000000;
public static void main(String[] args) throws IOException {
List<String[]> sentences = new ArrayList<String[]>(INITIAL_SENTENCES);
BufferedReader stdin = new BufferedReader(new InputStreamReader(System.in));
long numTokens = 0;
String line;
while ((line = stdin.readLine()) != null) {
String[] sentence = line.split("\\s+");
if (sentence.length > 0) {
sentences.add(sentence);
numTokens += sentence.length;
}
}
System.out.println("Read " + sentences.size() + " sentences, " + numTokens + " tokens.");
}
}
Seems pretty cut-and-dried, right? You'll notice I even pre-size my ArrayList; I have a little less than 66 million sentences and 1.3 billion tokens. Now if you whip out your Java object sizes reference and your pencil, you'll find that should require about:
66e6 String[] references # 8 bytes ea = 0.5 GB
66e6 String[] objects # 32 bytes ea = 2 GB
66e6 char[] objects # 32 bytes ea = 2 GB
1.3e9 String references # 8 bytes ea = 10 GB
1.3e9 Strings # 44 bytes ea = 53 GB
8e9 chars # 2 bytes ea = 15 GB
83 GB. (You'll notice I really do need to use 64-bit object sizes, since Compressed OOPs can't help me with > 32 GB heap.) We're fortunate to have a RedHat 6 machine with 128 GB RAM, so I fire up my Java HotSpot(TM) 64-bit Server VM (build 20.4-b02, mixed mode) from my Java SE 1.6.0_29 kit with pv giant-file.txt | java -Xmx96G -Xms96G LoadTokens just to be safe, and kick back while I watch top.
Somewhere less than halfway through the input, at about 50-60 GB RSS, the parallel garbage collector kicks up to 1300% CPU (16 proc box) and read progress stops. Then it goes a few more GB, then progress stops for even longer. It fills up 96 GB and ain't done yet. I've let it go for an hour and a half, and it's just burning ~90% system time doing GC. That seems extreme.
To make sure I wasn't crazy, I whipped up the equivalent Python (all two lines ;) and it ran to completion in about 12 minutes and 70 GB RSS.
So: am I doing something dumb? (Aside from the generally inefficient way things are being stored, which I can't really help -- and even if my data structures are fat, as long as they they fit, Java shouldn't just suffocate.) Is there magic GC advice for really large heaps? I did try -XX:+UseParNewGC and it seems even worse.

-XX:+UseConcMarkSweepGC: finishes in 78 GB and ~12 minutes. (Almost as good as Python!) Thanks for everyone's help.

Idea 1
Start by considering this:
while ((line = stdin.readLine()) != null) {
It at least used to be the case that readLine would return a String with a backing char[] of at least 80 characters. Whether or not that becomes a problem depends on what the next line does:
String[] sentence = line.split("\\s+");
You should determine whether the strings returned by split keep the same backing char[].
If they do (and assuming your lines are often shorter than 80 characters) you should use:
line = new String(line);
This will create a clone of the copy of the string with a "right-sized" string array
If they don't, then you should potentially work out some way of creating the same behaviour but changing it so they do use the same backing char[] (i.e. they're substrings of the original line) - and do the same cloning operation, of course. You don't want a separate char[] per word, as that'll waste far more memory than the spaces.
Idea 2
Your title talks about the poor performance of lists - but of course you can easily take the list out of the equation here by simply creating a String[][], at least for test purposes. It looks like you already know the size of the file - and if you don't, you could run it through wc to check beforehand. Just to see if you can avoid that problem to start with.
Idea 3
How many distinct words are there in your corpus? Have you considered keeping a HashSet<String> and adding each word to it as you come across it? That way you're likely to end up with far fewer strings. At this point you would probably want to abandon the "single backing char[] per line" from the first idea - you'd want each string to be backed by its own char array, as otherwise a line with a single new word in is still going to require a lot of characters. (Alternatively, for real fine-tuning, you could see how many "new words" there are in a line and clone each string or not.)

You should use the following tricks:
Help the JVM to collect the same tokens into a single String reference thanks to sentences.add(sentence.intern()). See String.intern for details. As far as I know, it should also have the effect Jon Skeet spoke about, it cuts char array into small pieces.
Use experimental HotSpot options to compact String and char[] implementations and related ones:
-XX:+UseCompressedStrings -XX:+UseStringCache -XX:+OptimizeStringConcat
With such memory amount, you should configure your system and JVM to use large pages.
It is really difficult to improve performance with GC tuning alone and more than 5%. You should first reduce your application memory consumption thanks to profiling.
By the way, I wonder if you really need to get the full content of a book in memory - I do not know what your code does next with all sentences but you should consider an alternate option like Lucene indexing tool to count words or extracting any other information from your text.

You should check the way how your heap space is splitted into parts (PermGen, OldGen, Eden and Survivors) thanks to VisualGC which is now a plugin for VisualVM.
In your case, you probably want to reduce Eden and Survivors to increase the OldGen so that your GC does not spin into collecting a full OldGen...
To do so, you have to use advanced options like:
-XX:NewRatio=2 -XX:SurvivorRatio=8
Beware these zones and their default allocation policy depends on the collector you use. So change one parameter at a time and check again.
If all that String should live in memory all the JVM livetime, it is a good idea to internalising them in PermGen defined large enough with -XX:MaxPermSize and to avoid collection on that zone thanks to -Xnoclassgc.
I recommend you to enable these debugging options (no overhead expected) and eventually post the gc log so that we can have an idea of your GC activity.
-XX:+PrintGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xloggc:verbosegc.log

OutofMemory LinkedList Adding Error

I am trying to read from a txt file (book) and then add every line of it to a linkedlist. However, when I run the code, I got an outofmemory error at l.add(line);. Could you tell me what I am doing wrong with this code? Or, is there a better way to store the String values instead of LinkedList?
Thanks a lot!
public Book (String bookname) throws java.io.IOException{
f = new FileReader(bookname);
b = new BufferedReader(f);
l = new LinkedList<String>();
String line = b.readLine();
while (line != null) {
l.add(line);
}
b.close();
}

As others point out, you have created an infinite, memory-consuming loop. A common idiom for reading from a BufferedReader is:
String line;
while ( ( line = b.readLine() ) != null) {
l.add(line);
}
I guess it is possible that the content of the book is just too large to fit into memory all at once anyway. You can increase the memory available to the JVM by using the Xmx argument, ie:
java -Xmx1G MyClass
The default value for this is 64 Mb, which isn't much these days.

You are adding the same line over and over, until memory runs out:
String line = b.readLine();
while (line != null) {
l.add(line);
}
See? The line-variable is read outside the loop, and never changes within the loop.

Probably you should replace
while (line != null) {
l.add(line);
}
with
while (line = b.readLine()) {
l.add(line);
}

While loop never quits, because variable
line is never null. Try this:
String line = "";
while ((line = b.readLine())!= null)
{
l.add(line);
}
b.close();

Quite simply, the memory required to store the strings (and everything else in your program) exceeded the total free memory available on the heap.
Other lists will have slightly different amounts of overhead, but realistically the memory requirements of the structure itself is likely to be insignificant compared to the data it contains. In other words, switching to a different list implementation might let you read a few more lines before falling over, but it won't solve the problem.
If you haven't increased the heap space of the java application, it might be running with fairly low defaults. In which case, you should consider providing the following command-line argument to your invocation of java:
-Xmx512m
(where the 512m implies 512 megabytes of heap space; you could use e.g. -Xmx2g or whatever else you feel is appropriate.)
On the other hand, if you're already running with a large heap (much larger than the total size of the Strings you want to hold in memory), this could point to a memory problem somewhere else. How many characters are there in the book? It will take at least twice that many bytes to store all of the lines, and probably 20% or so more than that to account for overhead. If your calculations indicate that your current heap space should be able to hold all of this data, there might be some trouble elsewhere. Otherwise, you now know what you'd need to increase you heap to as a minimum.
(As an aside, trying to process large amounts of input as a single batch can often run into memory issues - what if you want to process an 8GB text file? Often it's better to process smaller chunks sequentially, in some sort of stream. For example, if you wanted to uppercase every character and write it back out to a different file, you could do this a line at a time rather than reading the whole book into memory first.)

I agree with mjg123. And be careful with the outofmemory expection. and i request you to have to look at this blog for more details of how to handle such situations
Click here

Copying a java text file into a String

I run into the following errors when i try to store a large file into a string.
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2882)
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:515)
at java.lang.StringBuffer.append(StringBuffer.java:306)
at rdr2str.ReaderToString.main(ReaderToString.java:52)
As is evident, i am running out of heap space. Basically my pgm looks like something like this.
FileReader fr = new FileReader(<filepath>);
sb = new StringBuffer();
char[] b = new char[BLKSIZ];
while ((n = fr.read(b)) > 0)
sb.append(b, 0, n);
fileString = sb.toString();
Can someone suggest me why i am running into heap space error? Thanks.

You are running out of memory because the way you've written your program, it requires storing the entire, arbitrarily large file in memory. You have 2 options:
You can increase the memory by passing command line switches to the JVM:
java -Xms<initial heap size> -Xmx<maximum heap size>
You can rewrite your logic so that it deals with the file data as it streams in, thereby keeping your program's memory footprint low.
I recommend the second option. It's more work but it's the right way to go.
EDIT: To determine your system's defaults for initial and max heap size, you can use this code snippet (which I stole from a JavaRanch thread):
public class HeapSize {
public static void main(String[] args){
long kb = 1024;
long heapSize = Runtime.getRuntime().totalMemory();
long maxHeapSize = Runtime.getRuntime().maxMemory();
System.out.println("Heap Size (KB): " + heapSize/1024);
System.out.println("Max Heap Size (KB): " + maxHeapSize/1024);
}
}

You allocate a small StringBuffer that gets longer and longer. Preallocate according to file size, and you will also be a LOT faster.
Note that java is Unicode, the string likely not, so you use... twice the size in memory.
Depending on VM (32 bit? 64 bit?) and the limits set (http://www.devx.com/tips/Tip/14688) you may simply not have enough memory available. How large is the file actually?

In the OP, your program is aborting while the StringBuffer is being expanded. You should preallocate that to the size you need or at least close to it. When StringBuffer must expand it needs RAM for the original capacity and the new capacity. As TomTom said too, your file is likely 8-bit characters so will be converted to 16-bit unicode in memory so it will double in size.
The program has not even encountered yet the next doubling - that is StringBuffer.toString() in Java 6 will allocate a new String and the internal char[] will be copied again (in some earlier versions of Java this was not the case). At the time of this copy you will need double the heap space - so at that moment at least 4 times what your actual files size is (30MB * 2 for byte->unicode, then 60MB * 2 for toString() call = 120MB). Once this method is finished GC will clean up the temporary classes.
If you cannot increase the heap space for your program you will have some difficulty. You cannot take the "easy" route and just return a String. You can try to do this incrementally so that you do not need to worry about the file size (one of the best solutions).
Look at your web service code in the client. It may provide a way to use a different class other than String - perhaps a java.io.Reader, java.lang.CharSequence, or a special interface, like the SAX related org.xml.sax.InputSource. Each of these can be used to build an implementation class that reads from your file in chunks as the callers needs it instead of loading the whole file at once.
For instance, if your web service handling routes can take a CharSequence then (if they are written well) you can create a special handler to return just one character at a time from the file - but buffer the input. See this similar question: How to deal with big strings and limited memory.

Kris has the answer to your problem.
You could also look at java commons fileutils' readFileToString which may be a bit more efficient.

Although this might not solve your problem, some small things you can do to make your code a bit better:
create your StringBuffer with an initial capacity the size of the String you are reading
close your filereader at the end: fr.close();

By default, Java starts with a very small maximum heap (64M on Windows at least). Is it possible you are trying to read a file that is too large?
If so you can increase the heap with the JVM parameter -Xmx256M (to set maximum heap to 256 MB)
I tried running a slightly modified version of your code:
public static void main(String[] args) throws Exception{
FileReader fr = new FileReader("<filepath>");
StringBuffer sb = new StringBuffer();
char[] b = new char[1000];
int n = 0;
while ((n = fr.read(b)) > 0)
sb.append(b, 0, n);
String fileString = sb.toString();
System.out.println(fileString);
}
on a small file (2 KB) and it worked as expected. You will need to set the JVM parameter.

Trying to read an arbitrarily large file into main memory in an application is bad design. Period. No amount of JVM settings adjustments/etc... are going to fix the core issue here. I recommend that you take a break and do some googling and reading about how to process streams in java - here's a good tutorial and here's another good tutorial to get you started.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.