I'm trying to read a large text corpus into memory with Java. At some point it hits a wall and just garbage collects interminably. I'd like to know if anyone has experience beating Java's GC into submission with large data sets.
I'm reading an 8 GB file of English text, in UTF-8, with one sentence to a line. I want to split() each line on whitespace and store the resulting String arrays in an ArrayList<String[]> for further processing. Here's a simplified program that exhibits the problem:
/** Load whitespace-delimited tokens from stdin into memory. */
public class LoadTokens {
private static final int INITIAL_SENTENCES = 66000000;
public static void main(String[] args) throws IOException {
List<String[]> sentences = new ArrayList<String[]>(INITIAL_SENTENCES);
BufferedReader stdin = new BufferedReader(new InputStreamReader(System.in));
long numTokens = 0;
String line;
while ((line = stdin.readLine()) != null) {
String[] sentence = line.split("\\s+");
if (sentence.length > 0) {
sentences.add(sentence);
numTokens += sentence.length;
}
}
System.out.println("Read " + sentences.size() + " sentences, " + numTokens + " tokens.");
}
}
Seems pretty cut-and-dried, right? You'll notice I even pre-size my ArrayList; I have a little less than 66 million sentences and 1.3 billion tokens. Now if you whip out your Java object sizes reference and your pencil, you'll find that should require about:
66e6 String[] references # 8 bytes ea = 0.5 GB
66e6 String[] objects # 32 bytes ea = 2 GB
66e6 char[] objects # 32 bytes ea = 2 GB
1.3e9 String references # 8 bytes ea = 10 GB
1.3e9 Strings # 44 bytes ea = 53 GB
8e9 chars # 2 bytes ea = 15 GB
83 GB. (You'll notice I really do need to use 64-bit object sizes, since Compressed OOPs can't help me with > 32 GB heap.) We're fortunate to have a RedHat 6 machine with 128 GB RAM, so I fire up my Java HotSpot(TM) 64-bit Server VM (build 20.4-b02, mixed mode) from my Java SE 1.6.0_29 kit with pv giant-file.txt | java -Xmx96G -Xms96G LoadTokens just to be safe, and kick back while I watch top.
Somewhere less than halfway through the input, at about 50-60 GB RSS, the parallel garbage collector kicks up to 1300% CPU (16 proc box) and read progress stops. Then it goes a few more GB, then progress stops for even longer. It fills up 96 GB and ain't done yet. I've let it go for an hour and a half, and it's just burning ~90% system time doing GC. That seems extreme.
To make sure I wasn't crazy, I whipped up the equivalent Python (all two lines ;) and it ran to completion in about 12 minutes and 70 GB RSS.
So: am I doing something dumb? (Aside from the generally inefficient way things are being stored, which I can't really help -- and even if my data structures are fat, as long as they they fit, Java shouldn't just suffocate.) Is there magic GC advice for really large heaps? I did try -XX:+UseParNewGC and it seems even worse.
-XX:+UseConcMarkSweepGC: finishes in 78 GB and ~12 minutes. (Almost as good as Python!) Thanks for everyone's help.
Idea 1
Start by considering this:
while ((line = stdin.readLine()) != null) {
It at least used to be the case that readLine would return a String with a backing char[] of at least 80 characters. Whether or not that becomes a problem depends on what the next line does:
String[] sentence = line.split("\\s+");
You should determine whether the strings returned by split keep the same backing char[].
If they do (and assuming your lines are often shorter than 80 characters) you should use:
line = new String(line);
This will create a clone of the copy of the string with a "right-sized" string array
If they don't, then you should potentially work out some way of creating the same behaviour but changing it so they do use the same backing char[] (i.e. they're substrings of the original line) - and do the same cloning operation, of course. You don't want a separate char[] per word, as that'll waste far more memory than the spaces.
Idea 2
Your title talks about the poor performance of lists - but of course you can easily take the list out of the equation here by simply creating a String[][], at least for test purposes. It looks like you already know the size of the file - and if you don't, you could run it through wc to check beforehand. Just to see if you can avoid that problem to start with.
Idea 3
How many distinct words are there in your corpus? Have you considered keeping a HashSet<String> and adding each word to it as you come across it? That way you're likely to end up with far fewer strings. At this point you would probably want to abandon the "single backing char[] per line" from the first idea - you'd want each string to be backed by its own char array, as otherwise a line with a single new word in is still going to require a lot of characters. (Alternatively, for real fine-tuning, you could see how many "new words" there are in a line and clone each string or not.)
You should use the following tricks:
Help the JVM to collect the same tokens into a single String reference thanks to sentences.add(sentence.intern()). See String.intern for details. As far as I know, it should also have the effect Jon Skeet spoke about, it cuts char array into small pieces.
Use experimental HotSpot options to compact String and char[] implementations and related ones:
-XX:+UseCompressedStrings -XX:+UseStringCache -XX:+OptimizeStringConcat
With such memory amount, you should configure your system and JVM to use large pages.
It is really difficult to improve performance with GC tuning alone and more than 5%. You should first reduce your application memory consumption thanks to profiling.
By the way, I wonder if you really need to get the full content of a book in memory - I do not know what your code does next with all sentences but you should consider an alternate option like Lucene indexing tool to count words or extracting any other information from your text.
You should check the way how your heap space is splitted into parts (PermGen, OldGen, Eden and Survivors) thanks to VisualGC which is now a plugin for VisualVM.
In your case, you probably want to reduce Eden and Survivors to increase the OldGen so that your GC does not spin into collecting a full OldGen...
To do so, you have to use advanced options like:
-XX:NewRatio=2 -XX:SurvivorRatio=8
Beware these zones and their default allocation policy depends on the collector you use. So change one parameter at a time and check again.
If all that String should live in memory all the JVM livetime, it is a good idea to internalising them in PermGen defined large enough with -XX:MaxPermSize and to avoid collection on that zone thanks to -Xnoclassgc.
I recommend you to enable these debugging options (no overhead expected) and eventually post the gc log so that we can have an idea of your GC activity.
-XX:+PrintGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xloggc:verbosegc.log
Related
I have in the line a heap space error on an another machine, but it runs on my machine
I can't chance the properties of the another machine.
How can I solve this problem without using
Scanner.java ?
Is the argument of string.split correct with " " for split after spaces to split the String in pieces?
[File:]
U 1 234.003 30 40 50 true
T 2 234.003 10 60 40 false
Z 3 17234.003 30 40 50 true
M 4 0.500 30 40 50 true
/* 1000000+ lines */
java.lang.OutOfMemoryError: Java heap space
at java.base/java.util.Arrays.copyOfRange(Arrays.java:3821)
at java.base/java.lang.StringLatin1.newString(StringLatin1.java:764)
at java.base/java.lang.String.substring(String.java:1908)
at java.base/java.lang.String.split(String.java:2326)
at java.base/java.lang.String.split(String.java:2401)
at project.FileR(Fimporter.java:99)
public static DataBase File(String filename) throws IOException {
BufferedReader fs = new BufferedReader(new FileReader(filename),64 * 1024);
String line;
String[] wrds;
String A; int hash; double B; int C; int D; boolean E; DataBase DB = new DataBase();
while (true) {
line = fs.readLine();
if (line == null) {break;}
wrds = line.split(" "); /* this is line 99 in the error-message */
hash = Integer.parseInt(wrds[1]);
B = Double.parseDouble(wrds[2]);
C = Integer.parseInt(wrds[3]);
D = Integer.parseInt(wrds[4]);
E = Boolean.parseBoolean(wrds[5]);
// hash is hashcode for all values B C D E in DataBase DB
DB.listB.put(hash,B);
DB.listC.put(hash,C);
DB.listD.put(hash,D);
DB.listE.put(hash,E);
}
How can I solve this problem without using Scanner.java ?
Scanner is not the issue.
If you are getting OOME's with this code, the most likely root cause is the following:
DB.listB.put(hash,B);
DB.listC.put(hash,C);
DB.listD.put(hash,D);
DB.listE.put(hash,E);
You appear to loading all of your data into 4 maps. (You haven't shown us the relevant code ... but I am making an educated guess here.)
My second guess is that your input files are very large, and the amount of memory needed to hold them in the above data structures is simply too large for the "other" machine's heap.
The fact that the OOME's are occurring in a String.split call is not indicative of a problem in split per se. This is just the proverbial "straw that broke the camel's back". The root cause of the problem is in what you are doing with the data after splitting it.
Possible solutions / workarounds:
Increase the heap size on the "other" machine. If you haven't set the -Xmx or -Xms options, the JVM will use the default max heap size ... which is typically 1/4 of the physical memory.
Read the command documentation for the java command to understand what -Xmx and -Xms do and how to set them.
Use more memory efficient data structures:
Create a class to represent a tuple consisting of B, C, D, E values. Then replace the 4 maps with a map of these tuples.
Use a more memory efficient Map type.
Consider using a sorted array of tuples (including the hash) and using binary search to look them up.
Redesign your algorithms so that they don't need all of the data in memory at the same time; e.g. split the input into smaller files and process them separately. (This may not be possible ....)
If I'am not mistaken, you can allocate more heap-size when launching your jar file, e.g.:
java -Xmx256M -jar MyApp.jar
which means, you can change those settings.
But then again, just increasing the heap-size will not get rid of that problem, if the files get bigger, chances of oom increase.
You could think of splitting large files before processing like only process first X lines, then force GC to run (by nulling) and then process next lines.
I have to read a text file of 226mb made like this:
0 25
1 1382
2 99
3 3456
4 921
5 1528
6 578
7 122
8 528
9 81
the first number is a index, the second a value. I want to load a vector of short reading this file (8349328 positions), so I wrote this code:
Short[] docsofword = new Short[8349328];
br2 = new BufferedReader(new FileReader("TermOccurrenceinCollection.txt"));
ss = br2.readLine();
while(ss!=null)
{
docsofword[Integer.valueOf(ss.split("\\s+")[0])] = Short.valueOf(ss.split("\\s+")[1]); //[indexTerm] - numOccInCollection
ss = br2.readLine();
}
br2.close();
It turns out that the entire load takes an incredible amount of memory of 4.2GB. Really i don't understand why, i expected a 15MB vector.
Thanks for any answer.
There are multiple effects at work here.
First, you declared your array as type Short[] insted of short[]. The former is a reference type, meaning each value is wrapped into an instance of Short, consuming the overhead of a full blown object (most likely 16 bytes instead of two). This also inflates each array slot from two bytes to the reference size (generally 4 or 8 bytes, depending on heap size and 32/64 bit VM). The minimum size you can expect for the fully populated array is thus approximately: 8349328 x 20 = 160MB.
Your reading code is happily producing tons of garbage objects - you are using again a wrapper type (Integer) to address the array where a simple int would do. Thats at least 16 bytes of garbage where it would be zero with int. String.split is another culprit, you force the compilation of two regular expressions per line, plus create two strings. Thats numerous short lived objects that become garbage for each line. All of that could be avoided with a few more lines of code.
So you have a relatively memory hungry array, and lots of garbage. The garbage memory can be cleaned up, but the JVM decides when. The decision is based on available maximum heap memory and garbage collector parameters. If you supplied no arguments for either, the JVM will happily fill your machines memory before it attempts to reclaim garbage.
TLDR: Inefficient reading code paired with no JVM parameters.
If file is generated by you, use objectOutputStream, It very easy way to read the file.
As #Durandal, change the code accordingly. I am giving sample code below.
short[] docsofword = new short[8349328];
br2 = new BufferedReader(new FileReader("TermOccurrenceinCollection.txt"));
ss = br2.readLine();
int strIndex, index;
while(ss!=null)
{
strIndex = ss.indexOf( ' ' );
index = Integer.parseInt(ss.subStr(0, strIndex));
docsofword[index] = Short.parseShort(ss.subStr(strIndex+1));
ss = br2.readLine();
}
br2.close();
Even you can optimise further. Instead of indexOf() we can write our own method, when char is matching to space, parse string as integer. After that we will get indexOf Space and index for get remain string.
I'm writing a program in java which has to make use of a large hash-table, the bigger the hash-table can be, the better (It's a chess program :P). Basically as part of my hash table I have an array of "long[]", an array of "short[]", and two arrays of "byte[]". All of them should be the same size. When I set my table size to ten-million however, it crashes and says "java heap out of memory". This makes no sense to me. Here's how I see it:
1 Long + 1 Short + 2 Bytes = 12 bytes
x 10,000,000 = 120,000,000 bytes
/ 1024 = 117187.5 kB
/ 1024 = 114.4 Mb
Now, 114 Mb of RAM doesn't seem like too much to me. In total my CPU has 4Gb of RAM on my mac, and I have an app called FreeMemory which shows how much RAM I have free and it's around 2Gb while running this program. Also, I set the java preferences like -Xmx1024m, so java should be able to use up to a gig of memory. So why won't it let me allocate just 114Mb?
You predicted that it should use 114 MB and if I run this (on a windows box with 4 GB)
public static void main(String... args) {
long used1 = memoryUsed();
int Hash_TABLE_SIZE = 10000000;
long[] pos = new long[Hash_TABLE_SIZE];
short[] vals = new short[Hash_TABLE_SIZE];
byte[] depths = new byte[Hash_TABLE_SIZE];
byte[] flags = new byte[Hash_TABLE_SIZE];
long used2 = memoryUsed() - used1;
System.out.printf("%,d MB used%n", used2 / 1024 / 1024);
}
private static long memoryUsed() {
return Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory();
}
prints
114 MB used
I suspect you are doing something else which is the cause of your problem.
I am using Oracle HotSpot Java 7 update 10
Has not taken into account that each object is a reference and also use memory, and more "hidden things"... we must also take into account also the alignment... byte is not always a byte ;-)
Java Objects Memory Structure
How much memory is used by Java
To see how much memory is really in use, you can use a profiler:
visualvm
If you are using standard HashMap (or similar from JDK), each "long" (boxing/unboxing) really are more than 8bytes), you can use this as a base... (use less memory)
NativeIntHashMap
From what I have read about BlueJ, and serious technical information is almost impossible to find, BlueJ VM is quite likely not to support primitive types at all; your arrays are actually of boxed primitives. BlueJ uses a subset of all Java features, with emphasis on object orientation.
If that is the case, plus taking into consideration that performance and efficiency are quite low on BlueJ VM's list of priorities, you may actually be using quite a bit more memory than you think: a whole order of magnitude is quite imaginable.
I believe one way it would be to clean the heap memory after each execution, one link is here:
Java heap space out of memory
I've got a question about storing huge amount of Strings in application memory. I need to load from file and store about 5 millions lines, each of them max 255 chars (urls), but mostly ~50. From time to time i'll need to search one of them. Is it possible to do this app runnable on ~1GB of RAM?
Will
ArrayList <String> list = new ArrayList<String>();
work?
As far as I know String in java is coded in UTF-8, what gives me huge memory use. Is it possible to make such array with String coded in ANSI?
This is console application run with parameters:
java -Xmx1024M -Xms1024M -jar "PServer.jar" nogui
The latest JVMs support -XX:+UseCompressedStrings by default which stores strings which only use ASCII as a byte[] internally.
Having several GB of text in a List isn't a problem, but it can take a while to load from disk (many seconds)
If the average URL is 50 chars which are ASCII, with 32 bytes of overhead per String, 5 M entries could use about 400 MB which isn't much for a modern PC or server.
A Java String is a full blown object. This means that appart from the characters of the string theirselves, there is other information to store in it (a pointer to the class of the object, a counter with the number of pointers pointing to it, and some other infrastructure data). So an empty String already takes 45 bytes in memory (as you can see here).
Now you just have to add the maximum lenght of your string and make some easy calculations to get the maximum memory of that list.
Anyway, I would suggest you to load the string as byte[] if you have memory issues. That way you can control the encoding and you can still do searchs.
Is there some reason you need to restrict it to 1G? If you want to search through them, you definitely don't want to swap to disk, but if the machine has more memory it makes sense to go higher then 1G.
If you have to search, use a SortedSet, not an ArrayList
I run into the following errors when i try to store a large file into a string.
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2882)
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:515)
at java.lang.StringBuffer.append(StringBuffer.java:306)
at rdr2str.ReaderToString.main(ReaderToString.java:52)
As is evident, i am running out of heap space. Basically my pgm looks like something like this.
FileReader fr = new FileReader(<filepath>);
sb = new StringBuffer();
char[] b = new char[BLKSIZ];
while ((n = fr.read(b)) > 0)
sb.append(b, 0, n);
fileString = sb.toString();
Can someone suggest me why i am running into heap space error? Thanks.
You are running out of memory because the way you've written your program, it requires storing the entire, arbitrarily large file in memory. You have 2 options:
You can increase the memory by passing command line switches to the JVM:
java -Xms<initial heap size> -Xmx<maximum heap size>
You can rewrite your logic so that it deals with the file data as it streams in, thereby keeping your program's memory footprint low.
I recommend the second option. It's more work but it's the right way to go.
EDIT: To determine your system's defaults for initial and max heap size, you can use this code snippet (which I stole from a JavaRanch thread):
public class HeapSize {
public static void main(String[] args){
long kb = 1024;
long heapSize = Runtime.getRuntime().totalMemory();
long maxHeapSize = Runtime.getRuntime().maxMemory();
System.out.println("Heap Size (KB): " + heapSize/1024);
System.out.println("Max Heap Size (KB): " + maxHeapSize/1024);
}
}
You allocate a small StringBuffer that gets longer and longer. Preallocate according to file size, and you will also be a LOT faster.
Note that java is Unicode, the string likely not, so you use... twice the size in memory.
Depending on VM (32 bit? 64 bit?) and the limits set (http://www.devx.com/tips/Tip/14688) you may simply not have enough memory available. How large is the file actually?
In the OP, your program is aborting while the StringBuffer is being expanded. You should preallocate that to the size you need or at least close to it. When StringBuffer must expand it needs RAM for the original capacity and the new capacity. As TomTom said too, your file is likely 8-bit characters so will be converted to 16-bit unicode in memory so it will double in size.
The program has not even encountered yet the next doubling - that is StringBuffer.toString() in Java 6 will allocate a new String and the internal char[] will be copied again (in some earlier versions of Java this was not the case). At the time of this copy you will need double the heap space - so at that moment at least 4 times what your actual files size is (30MB * 2 for byte->unicode, then 60MB * 2 for toString() call = 120MB). Once this method is finished GC will clean up the temporary classes.
If you cannot increase the heap space for your program you will have some difficulty. You cannot take the "easy" route and just return a String. You can try to do this incrementally so that you do not need to worry about the file size (one of the best solutions).
Look at your web service code in the client. It may provide a way to use a different class other than String - perhaps a java.io.Reader, java.lang.CharSequence, or a special interface, like the SAX related org.xml.sax.InputSource. Each of these can be used to build an implementation class that reads from your file in chunks as the callers needs it instead of loading the whole file at once.
For instance, if your web service handling routes can take a CharSequence then (if they are written well) you can create a special handler to return just one character at a time from the file - but buffer the input. See this similar question: How to deal with big strings and limited memory.
Kris has the answer to your problem.
You could also look at java commons fileutils' readFileToString which may be a bit more efficient.
Although this might not solve your problem, some small things you can do to make your code a bit better:
create your StringBuffer with an initial capacity the size of the String you are reading
close your filereader at the end: fr.close();
By default, Java starts with a very small maximum heap (64M on Windows at least). Is it possible you are trying to read a file that is too large?
If so you can increase the heap with the JVM parameter -Xmx256M (to set maximum heap to 256 MB)
I tried running a slightly modified version of your code:
public static void main(String[] args) throws Exception{
FileReader fr = new FileReader("<filepath>");
StringBuffer sb = new StringBuffer();
char[] b = new char[1000];
int n = 0;
while ((n = fr.read(b)) > 0)
sb.append(b, 0, n);
String fileString = sb.toString();
System.out.println(fileString);
}
on a small file (2 KB) and it worked as expected. You will need to set the JVM parameter.
Trying to read an arbitrarily large file into main memory in an application is bad design. Period. No amount of JVM settings adjustments/etc... are going to fix the core issue here. I recommend that you take a break and do some googling and reading about how to process streams in java - here's a good tutorial and here's another good tutorial to get you started.