I try to build a map with the content of a file and my code is as below:
System.out.println("begin to build the sns map....");
String basePath = PropertyReader.getProp("oldbasepath");
String pathname = basePath + "\\user_sns.txt";
FileReader fr;
Map<Integer, List<Integer>> snsMap =
new HashMap<Integer, List<Integer>>(2000000);
try {
fr = new FileReader(pathname);
BufferedReader br = new BufferedReader(fr);
String line;
int i = 1;
while ((line = br.readLine()) != null) {
System.out.println("line number: " + i);
i++;
String[] strs = line.split("\t");
int key = Integer.parseInt(strs[0]);
int value = Integer.parseInt(strs[1]);
List<Integer> list = snsMap.get(key);
//if the follower is not in the map
if(snsMap.get(key) == null)
list = new LinkedList<Integer>();
list.add(value);
snsMap.put(key, list);
System.out.println("map size: " + snsMap.size());
}
} catch (IOException e) {
e.printStackTrace();
}
System.out.println("finish building the sns map....");
return snsMap;
The program is very fast at first but gets much slowly when the information printed is :
map size: 1138338
line number: 30923602
map size: 1138338
line number: 30923603
....
I try to find to reason with two System.out.println() clauses to judge the preformance of BufferedReader and HashMap instead of a Java profiler.
Sometimes it takes a while to get the information of the map size after getting the line number information, and sometimes, it takes a while to get the information of the line number information after get the map size. My question is: which makes my program slow? the BufferedReader for a big file or HashMap for a big map?
If you are testing this from inside Eclipse, you should be aware of the huge performance penalty of writing to stdout/stderr, due to Eclipse's capturing that ouptut in the Console view. Printing inside a tight loop is always a performance issue, even outside of Eclipse.
But, if what you are complaining about is the slowdown experienced after processing 30 million lines, then I bet it's a memory issue. First it slows down due to intense GC'ing and then it breaks with OutOfMemoryError.
You will have to check you program with some profiling tools to understand why it is slow.
In general file access is much more slower than in memory operations (unless you are constrained in memory and doing excess GC) so the guess would be that reading file could be the slower here.
Before you profiled, you will not know what is slow and what isn't.
Most likely, the System.out will show up as being the bottleneck, and you'll then have to profile without them again. System.out is the worst thing you can do for finding performance bottlenecks, because in doing so you usually add an even worse bottleneck.
An obivous optimization for your code is to move the line
snsMap.put(key, list);
into the if statement. You only need to put this when you created a new list. Otherwise, the put will just replace the current value with itself.
Java cost associated with Integer objects (and in particular the use of Integers in the Java Collections API) is largely a memory (and thus Garbage Collection!) issue. You can sometimes get significant gains by using primitive collections such as GNU trove, depending how well you can adjust your code to use them efficiently. Most of the gains of Trove are in memory usage. Definitely try rewriting your code to use TIntArrayList and TIntObjectMap from GNU trove. I'd avoid linked lists, too, in particular for primitive types.
Roughly estimated, a HashMap<Integer, List<Integer>> needs at least 3*16 bytes per entry. The doubly linked list again needs at least 2*16 bytes per entry stored. 1m keys + 30m values ~ 1 GB. No overhead included yet. With GNU trove TIntObjectHash<TIntArrayList> that should be 4+4+16 bytes per key and 4 bytes per value, so 144 MB. The overhead is probably similar for both.
The reason that Trove uses less memory is because the types are specialized for primitive values such as int. They will store the int values directly, thus using 4 bytes to store each.
A Java collections HashMap consists of many objects. It roughly looks like this: there are Entry objects that point to a key and a value object each. These must be objects, because of the way generics are handled in Java. In your case, the key will be an Integer object, which uses 16 bytes (4 bytes mark, 4 bytes type, 4 bytes actual int value, 4 bytes padding) AFAIK. These are all 32 bit system estimates. So a single entry in the HashMap will probably need some 16 (entry) + 16 (Integer key) + 32 (yet empty LinkedList) bytes of memory that all need to be considered for garbage collection.
If you have lots of Integer objects, it just will take 4 times as much memory as if you could store everything using int primitives. This is the cost you pay for the clean OOP principles realized in Java.
The best way is to run your program with profiler (for example, JProfile) and see what parts are slow. Also debug output can slow your program, for example.
Hash Map is not slow, but in reality its the fastest among the maps. HashTable is the only thread safe among maps, and can be slow sometimes.
Important note: Close the BufferedReader and File after u read the data... this might help.
eg: br.close()
file.close()
Please check you system processes from task manager, there may be too may processes running in the background.
Sometimes eclipse is real resource heavy, so try to run it from console to check it.
Related
I'm trying to load in a csv file with a huge amount of lines (>5 million) but it slows down massively when trying to process them all into an arraylist of each value
I've tried a few different variations of reading and removing from the input list i loaded from the file, but it still ends up running out of heapspace, even when i allocate 14gb to the process, while the file is only 2gb
I know i need to be removing values so that i dont end up with duplicate references in memory, so that I dont end up with an arraylist of lines and also an arraylist of the individual comma seperated values, but i have no idea how to do something like that
Edit : For reference, in this particular situation, data should end up containing 16 * 5 million values.
If there's a more elegant solution, i'm all for it
The intention when loading this file is to process it as a database, with the appropriate methods like select and select where, all handled by a sheet class. It worked just fine with my smaller sample file of 36k lines, but i guess it doesnt scale very well
Current code :
//Load method to load it from file
private static CSV loadCSV(String filename, boolean absolute)
{
String fullname = "";
if (!absolute)
{
fullname = baseDirectory + filename;
if (!Load.exists(fullname,false))
return null;
}
else if (absolute)
{
fullname = filename;
if (!Load.exists(fullname,false))
return null;
}
ArrayList<String> output = new ArrayList<String>();
AtomicInteger atomicInteger = new AtomicInteger(0);
try (Stream<String> stream = Files.lines(Paths.get(fullname)))
{
stream.forEach(t -> {
output.add(t);
atomicInteger.getAndIncrement();
if (atomicInteger.get() % 10000 == 0)
{
Log.log("Lines done " + output.size());
}
});
CSV c = new CSV(output);
return c;
}
catch (IOException e)
{
Log.log("Error reading file " + fullname,3,"FileIO");
e.printStackTrace();
}
return null;
}
//Process method inside CSV class
public CSV(List<String> output)
{
Log.log("Inside csv " + output.size());
ListIterator<String> iterator = output.listIterator();
while (iterator.hasNext())
{
ArrayList<String> d = new ArrayList<String>(Arrays.asList(iterator.next().split(splitter,-1)));
data.add(d);
iterator.remove();
}
}
You need to use any database, which provide required functionality for your task (select, group).
Any database can effective read and aggregate 5 million rows.
Don't try to use "operations on ArrayList", it's works good only on small dataset.
I think some key concepts are missing here:
You said the file size is 2GB. That does not mean that when you load that file data in an ArrayList, the size in memory would also be 2GB. Why? Usually files store data using UTF-8 character encoding, whereas JVM internally stores String values using UTF-16. So, assuming your file contains only ASCII characters, each character occupies 1 byte in the filesystem whereas 2 bytes in memory. Assuming (for the sake of simplicity) all String values are unique, there will be space required to store the String references which are 32 bits each (assuming a 64 bits system with compressed oop). How much is your heap (excluding other memory areas)? How much is your eden space and old space? I'll come back to this again shortly.
In your code, you don't specify ArrayList size. This is a blunder in this case. Why? JVM creates a small ArrayList. After sometime JVM sees that this guy keeps pumping in data. Let's create a bigger ArrayList and copy the data of the old ArrayList into the new list. This event has some deeper implications when you are dealing with such huge volume of data: firstly, note that both the old and new arrays (with millions of entries) are in memory simultaneously occupying space, secondly unnecessarily data copy happens from one array to another - not once or twice but repeatedly, everytime the array run out of space. What happens to the old array? Well it's discarded and needs to be garbage collected. So, these repeated array copy and garbage collections slow down the process. CPU is really working hard here. What happens when your data no longer fits into the young generation (which is smaller than heap)? Maybe you need to see the behaviour using something like JVisualVM.
All in all, what I mean to say is there are good number of reasons why a 2GB file fills up your much larger heap and why your process performance is poor.
I would have a method that took a line read from the file as parameter and split it into a list of strings and then returned that list. I would then add that list to the CSV object in the file reading loop. That would mean only one large collection instead of two and the read lines could be freed from memory quicker.
Something like this
CSV csv = new CSV();
try (Stream<String> stream = Files.lines(Paths.get(fullname))) {
stream.forEach(t -> {
List<String> splittedString = splitFileRow(t);
csv.add(splittedString);
});
Trying to solve this problem using pure Java it is overwhelming. I suggest using a processing engine like Apache Spark that can process the file in a distributed way, by increasing the level of parallelism.
Apache Spark has specific APIs to load CSV file:
spark.read.format("csv").option("header", "true").load("../Downloads/*.csv")
You can transform it into an RDD, or Dataframe and perform operations on it.
You can find more online, or here
I work with text files with short strings in it (10 digits). The size of file is approx 1.5Gb, so the number of rows is reaching 100 millions.
Every day I get another file and need to extract new elements (tens of thousands a day).
What's the best approach to solve my problem?
I tried to load data in ArrayList - it takes around 20 seconds for each file, but substraction of arrays takes forever.
I use this code:
dataNew.removeAll(dataOld);
Tried to load data in HashSets - creation of HashSets is endless.
The same with LinkedHashset.
Tried to load into ArrayLists and to sort only one of them
Collections.sort(dataNew);
but it didn't speed up the process of
dataNew.removeAll(dataOld);
Also memory consumption is rather high - sort() finishes only with heap of 15Gb (13Gb is not enough).
I've tried to use old good linux util diff and it finished the task in 76 minutes (while eating 8Gb of RAM).
So, my goal is to solve the problem in Java within 1 hour of processing time (or less, of course) and with consumption of 15Gb (or better 8-10Gb).
Any suggestions, please?
Maybe I need not alphabetic sorting of ArrayList, but something else?
UPDATE:
This is a country-wide list of invalid passports. It is published as a global list, so I need to extract delta by myself.
Data is unsorted and each row is unique. So I must compare 100M elements with 100M elements. Dataline is for example, "2404,107263". Converting to integer is not possible.
Interesting, when I increased maximum heap size to 16Gb
java -Xms5G -Xmx16G -jar utils.jar
loading to HashSet became fast (50 seconds for first file), but program gets killed by system Out-Of-Memory killer, as it eats enormous amounts of RAM while loading second file to second HashSet or ArrayList
My code is very simple:
List<String> setL = Files.readAllLines(Paths.get("filename"));
HashSet<String> dataNew = new HashSet<>(setL);
on second file the program gets
Killed
[1408341.392872] Out of memory: Kill process 20538 (java) score 489 or sacrifice child
[1408341.392874] Killed process 20531 (java) total-vm:20177160kB, anon-rss:16074268kB, file-rss:0kB
UPDATE2:
Thanks for all your ideas!
Final solution is: converting lines to Long + using fastutil library (LongOpenHashSet)
RAM consumption became 3.6Gb and processing time only 40 seconds!
Interesting observation. While starting java with default settings made loading 100 million Strings to JDK's native HashSet endless (I interrupted after 1 hour), starting with -Xmx16G speedup the process to 1 minute. But memory consumption was ridiculous (around 20Gb), processing speed was rather fine - 2 minutes.
If someone is not limited to by RAM, native JDK HashSet is not so bad in terms of speed.
p.s. Maybe the task is not clearly explained, but I do not see any opportunity not to load at least one file entirely. So, I doubt memory consumption can be further lowered by much.
First of all, don't do Files.readAllLines(Paths.get("filename")) and then pass everything to a Set, that holds unnecesserily huge amounts of data. Try to hold as few lines as possible at all times.
Read the files line-by-line and process as you go. This immediately cuts your memory usage by a lot.
Set<String> oldData = new HashSet<>();
try (BufferedReader reader = Files.newBufferedReader(Paths.get("oldData"))) {
for (String line = reader.readLine(); line != null; line = reader.readLine()) {
// process your line, maybe add to the Set for the old data?
oldData.add(line);
}
}
Set<String> newData = new HashSet<>();
try (BufferedReader reader = Files.newBufferedReader(Paths.get("newData"))) {
for (String line = reader.readLine(); line != null; line = reader.readLine()) {
// Is it enough just to remove from old data so that you'll end up with only the difference between old and new?
boolean oldRemoved = oldData.remove(line);
if (!oldRemoved) {
newData.add(line);
}
}
}
You'll end up with two sets containing only the data that is present in the old, or the new dataset, respectively.
Second of all, try to presize your containers if at all possible. Their size (usually) doubles when they reach their capacity, and that could potentially create a lot of overhead when dealing with big collections.
Also, if your data are numbers, you could just use a long and hold that instead of trying to hold instances of String? There's a lot of collection libraries that enable you to do this, e.g. Koloboke, HPPC, HPPC-RT, GS Collections, fastutil, Trove. Even their collections for Objects might serve you very well as a standard HashSet has a lot of unnecessary object allocation.
Thank's for all your ideas!
Final solution is:
converting lines to Long + using fastutil library (LongOpenHashSet)
RAM consumption became 3.6Gb and processing time only 40 seconds!
Interesting observation. While starting java with default settings made loading 100 million Strings to JDK's native HashSet endless (I interrupted after 1 hour), starting with -Xmx16G speedup the process to 1 minute. But memory consumption was ridiculous (around 20Gb), processing speed was rather fine - 2 minutes.
If someone is not limited to by RAM, native JDK HashSet is not so bad in terms of speed.
p.s. Maybe the task is not clearly explained, but I do not see any opportunity not to load at least one file entirely. So, I doubt memory consumption can be further lowered by much.
Pls split the strings into two and whatever part (str1 or str2) is repeated most use the intern() on it so to save duplication os same String again in Heap. Here i used intern() on both part just to show the sample but dont use it unless they are repeating most.
Set<MyObj> lineData = new HashSet<MyObj>();
String line = null;
BufferedReader bufferedReader = new BufferedReader(new FileReader(file.getAbsoluteFile()));
while((line = bufferedReader.readLine()) != null){
String[] data = line.split(",");
MyObj myObj = new MyObj();
myObj.setStr1(data[0].intern());
myObj.setStr1(data[1].intern());
lineData.add(myObj);
}
public class MyObj {
private String str1;
private String str2;
public String getStr1() {
return str1;
}
public void setStr1(String str1) {
this.str1 = str1;
}
#Override
public int hashCode() {
final int prime = 31;
int result = 1;
result = prime * result + ((str1 == null) ? 0 : str1.hashCode());
result = prime * result + ((str2 == null) ? 0 : str2.hashCode());
return result;
}
#Override
public boolean equals(Object obj) {
if (this == obj)
return true;
if (obj == null)
return false;
if (getClass() != obj.getClass())
return false;
Test1 other = (Test1) obj;
if (str1 == null) {
if (other.str1 != null)
return false;
} else if (!str1.equals(other.str1))
return false;
if (str2 == null) {
if (other.str2 != null)
return false;
} else if (!str2.equals(other.str2))
return false;
return true;
}
public String getStr2() {
return str2;
}
public void setStr2(String str2) {
this.str2 = str2;
}
}
Use a database; to keep things simple, use a Java-embedded database (Derby, HSQL, H2, ...). With that much information, you can really benefit from standard DB caching, time-efficient storage, and querying. Your pseudo-code would be:
if first use,
define new one-column table, setting column as primary-key
iterate through input records, for each:
insert record into table
otherwise
open database with previous records
iterate through input records, for each:
lookup record in DB, update/report as required
Alternatively, you can do even less work if you use existing "table-diff" libraries, such as DiffKit - from their tutorial:
java -jar ../diffkit-app.jar -demoDB
Then configure a connection to this demo database within your favorite
JDBC enabled database browser
[...]
Your DB browser will show you the tables TEST10_LHS_TABLE and
TEST10_RHS_TABLE (amongst others) populated with the data values from
the corresponding CSV files.
That is: DiffKit does essentially what I proposed above, loading files into database tables (they use embedded H2) and then comparing these tables through DB queries.
They accept input as CSV files; but conversion from your textual input to their CSV can be done in a streaming fashion in less than 10 lines of code. And then you just need to call their jar to do the diff, and you would get the results as tables in their embedded DB.
I made a very simple spell checker, just checking if a word was in the dictionary was too slow for whole documents. I created a map structure, and it works great.
Map<String, List<String>> dictionary;
For the key, I use the first 2 letters of the word. The list has all the words that start with the key. To speed it up a bit more you can sort the list, then use a binary search to check for existence. I'm not sure the optimum length of key, and if your key gets too long you could nest the maps. Eventually it becomes a tree. A trie structure is possibly the best actually.
You can use a trie data structure for such cases: http://www.toptal.com/java/the-trie-a-neglected-data-structure
The algorithm would be as follows:
Read the old file line by line and store each line in the trie.
Read the new file line by line and test each line whether it is
in the trie: if it is not, then it is a newly added line.
A further memory optimization can take advantage that there are only 10 digits, so 4 bits is enough to store a digit (instead of 2 bytes per character in Java). You may need to adapt the trie data structure from one of the following links:
Trie data structures - Java
http://algs4.cs.princeton.edu/52trie/TrieST.java.html
The String object holding 11 characters (up to 12 in-fact) will have a size of 64 bytes (on 64bits Java with compressed oops). The only structure that can hold so much elements and be of a reasonable size is an array:
100,000,000 * (64b per String object + 4b per reference) = 6,800,000,000b ~ 6.3Gb
So you can immediately forget about Maps, Sets, etc as they introduce too much memory overhead.. But array is actually all you need. My approach would be:
Load the "old" data into an array, sort it (this should be fast enough)
Create a back-up array of primitive booleans with same size as the loaded array (you can use the BitSet here as well)
Read line by line from the new data file. Use binary search to check if the password data exists in the old data array. If the item exist mark it's index in the boolean array/bitset as true (you get back the index from the binary search). If the item does not exists just save it somewhere (array list can serve).
When all lines are processed remove from old array all the items that have false in boolean array/bitset (check by index of course). And finally add to the array all the new data you saved somewhere.
Optionally sort the array again and save to disk, so next time you load it you can skip the initial sorting.
This should be fast enough imo. Initial sort is O(n log(n)), while the binary search is O(log(n)) thus you should end up with (excluding final removal + adding which can be max 2n):
n log(n) (sort) + n log(n) (binary check for n elements) = 2 n log(n)
There would be other optimizations possible if you would explain more on the structure of that String you have (if there is some pattern or not).
The main problem in numerous resizing ArrayList when readAllLines() occurs. Better choice is LinkedList to insert data
try (BufferedReader reader = Files.newBufferedReader(path, StandardCharsets.UTF_8)) {
List<String> result = new LinkedList<>();
for (;;) {
String line = reader.readLine();
if (line == null)
break;
result.add(line);
}
return result;
}
I read a dictionary that might be 100MB or so in size (sometimes gets bigger up to max 500MB). It is a simple dictionary of two columns, the first column words the second column a float value. I read the dictionary file it in this way:
BufferedReader br = new BufferedReader(new FileReader(file));
String line;
while((line = br.readLine()) != null) {
String[] cols = line.split("\t");
setIt(cols[0], cols[1]);
and for the setIt function:
public void setIt(String term, String value) {
all.put(term, new Double(value));
}
When I have a big file, it takes a long time to load it and it often goes out of memory. Even with a reasonable size file (100MB) it does need a 4GB memory in Java to be run.
Any clue how to improve it while not changing the structure of the whole package?
EDIT: I'm using a 50MB file with -Xmx1g and I still get the error.
UPDATE: There were some iterations over the file that I fixed them and now the memory problem was partially solved. Yet to try the properties and other solutions and report on that.
You are allocating a new String for every line. There is some overhead associated with a String. See Here for a calculation. This article also addresses the subject of object memory use in java.
There is a stack overflow question on the subject of more memory efficient replacements for strings here.
Is there something you can do to avoid all those allocations? For example, are there a limited number of strings that you could represent as an integer in your data structure, and then use a smaller lookup table to translate?
You can do a lot of things to reduce memory usage. for example :
1- replace String[] cols = line.split("\t"); with :
static final Pattern PATTERN = Pattern.compile("\t");
//...
String[] cols = PATTERN.split(line);
2- use .properties file to store your dictionary and simply load it this way :
Properties properties = new Properties();
//...
try (FileInputStream fileInputStream = new FileInputStream("D:/dictionary.properties")) {
properties.load(fileInputStream);
}
Map<String, Double> map = new HashMap<>();
Enumeration<?> enumeration = properties.propertyNames();
while (enumeration.hasMoreElements()){
String key = (String) enumeration.nextElement();
map.put(key, new Double(properties.getProperty(key)));
}
//...
dictionary.properties :
A = 1
B = 2
C = 3
//...
3- use StringTokenizer :
StringTokenizer tokenizer = new StringTokenizer(line, "\t");
setIt(tokenizer.nextToken(), tokenizer.nextToken());
Well my solution will deviate little bit from your code ...
Use Lucene or more specifically Lucene Dictionary or even more specifically Lucene Spell Checker depends upon what you want.
Lucene handle any amount of data with efficient memory usage ..
Your problem is that you are storing whole Dictionary in memory ... Lucene store it in file with hashing and then it take search result from file at runtime but efficiently. This save lot of memory. You can customize search depends upon your needs
Small Demo of Lucene
A few causes for this problem would be.
1). The String array cols is using up too much memory.
2). The String line might also be using too much memory, unlikely though.
3). While java is opening and reading the file its also using memory so that's also a probability.
4). Your map put will also be taking up a small amount of memory.
It might also be all these things combined, so maybe try and comment some lines out and see if works then.
The most likely cause is all these things added up is eating your memory. So a 10 megabyte file could end up being 50 megabytes. Also make sure to .close() all input steams and try to reallocate ram by splitting up your methods so variables get garbage collected.
As for doing this without changing package structure or java heap size arguments i'm not sure it will be very easy, if possible at all.
Hope this helps.
I have two large CSV files which contain data that is required for users of a web application to validate some info. I defined an ArrayList< String[] > and intended to keep the contents of both files in memory so I wouldn't have to read them each time a user logged in and used the application.
I'm getting a java.lang.OutOfMemoryError: Java heap space, though, when initializing the application and trying to read the second file. (It finishes reading the first file just fine but hangs when reading the second file and after a while I get that exception)
The code for reading the files is pretty straight forward:
ArrayList<String[]> tokenizedLines = new ArrayList<String[]>();
public void parseTokensFile() throws Exception {
BufferedReader bRead = null;
FileReader fRead = null;
try {
fRead = new FileReader(this.tokensFile);
bRead = new BufferedReader(fRead);
String line;
while ((line = bRead.readLine()) != null) {
tokenizedLines.add(StringUtils.split(line, fieldSeparator));
}
} catch (Exception e) {
throw new Exception("Error parsing file.");
} finally {
bRead.close();
fRead.close();
}
}
I read Java's split function could use up a lot of memory when reading large amounts of data since the substring function makes a reference to the original string, so a substring of some String will use up the same amount of memory as the original, even though we only want a few chars, so I made a simple split function to try avoiding this:
public String[] split(String inputString, String separator) {
ArrayList<String> storage = new ArrayList<String>();
String remainder = new String(inputString);
int separatorLength = separator.length();
while (remainder.length() > 0) {
int nextOccurance = remainder.indexOf(separator);
if (nextOccurance != -1) {
storage.add(new String(remainder.substring(0, nextOccurance)));
remainder = new String(remainder.substring(nextOccurance + separatorLength));
} else {
break;
}
}
storage.add(remainder);
String[] tokenizedFields = storage.toArray(new String[storage.size()]);
storage = null;
return tokenizedFields;
}
This gives me the same error though, so I'm wondering if it's not a memory leak but simply that I can't have structures with so many objects in memory. One file is about 600'000 lines long, with 5 fields per line, and the other is around 900'000 lines long with about the same amount of fields per line.
The full stacktrace is:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at xxx.xxx.xxx.StringUtils.split(StringUtils.java:16)
at xxx.xxx.xxx.GFTokensFile.parseTokensFile(GFTokensFile.java:36)
So, after the long post (sorry :P), is this a restriction of the amount of memory assigned to my JVM or am I missing something obvious and wasting resources somewhere?
Your JVM won't get more than 2GB on a 32-bit operating system with 4GB of RAM. That's one upper limit.
The second is the max heap size you specify when you start the JVM. Look at that -Xmx parameter.
The third is the fact of life that you cannot fit X units of anything into a Y sized container where X > Y. You know the size of your files. Try parsing each one individually and seeing what kind of heap they're consuming.
I'd recommend that you download Visual VM, install all the available plugins, and have it monitor your application while it's running. You'll be able to see the entire heap, perm gen space, GC collection, what objects are taking up the most memory, etc.
Getting data is invaluable for all problems, but especially ones like this. Without it, you're just guessing.
I cannot see a storage leak in the original version of the program.
The scenarios where split and similar methods can leak significant storage are rather limitted:
You have to NOT be retaining a reference to the original string that you split.
You need to be retaining references to a subset of the strings produced by the string splitting.
What happens when String.substring() is called is that it creates a new String object that shares the original String's backing array. If the original String reference is then garbage collected, then the substring String is now holding onto an array of characters that includes characters that are not "in" the substring. This can be a storage leak, depending on how long the substring is kept.
In your example, you are keeping strings that contain all characters apart for the field separator character. There is a good chance that this is actually saving space ... compared to the space used if each substring was an independent String. Certainly, it is no surprise that your version of split doesn't solve the problem.
I think you need to either increase the heap size, or change your application so that it doesn't need to keep all of the data in memory at the same time.
Try improving your code or leave data processing to a database.
The memory usage is larger as your file sizes, since the code makes redundant copies of the processed data. There is a to be processed one processed and some partial data.
String is immutable, see here, no need to use new String(...) to store the result, split does that copy already.
If you can, delegate the whole data storage and searching to a database. CSV files are easily imported/exported to databases and they do all the hard work.
While I wouldn't recommend actual string interning for what you are doing, how about using the idea behind that technique? You could use a HashSet or HashMap to make sure you only use a single String instance whenever your data contains the same sequence of characters. I mean, there must be some kind of overlap in the data, right?
On the other hand, what you might be seeing here could be a bad case of heap fragmentation. I'm not sure how the JVM handles these cases, but in the Microsoft CLR larger objects (especially arrays) will be allocated on a separate heap. Growth strategies, such as those of the ArrayList will create a larger array, then copy over the content of the previous array before releasing the reference to it. The Large Object Heap (LOH) isn't compacted in the CLR, so this growth strategy will leave huge areas of free memory that the ArrayList can no longer use.
I don't know how much of that applies to the Lava VM, but you could try building the list using LinkedList first, then dump the list content into an ArrayList or directly into an array. That way the large array of lines would be created only once, without causing any fragmentation.
Be sure that the total length of both files is lower than your heap size. You can set the max heap size using the JVM option -Xmx.
Then if you have so much content maybe you shouldn't load it entirely in memory. One time I had a similar problem and I fixed it using an index file that store index of informations in the large file. then I just had to read one line at the good offset.
Also in your split method there is some strange things.
String remainder = new String(inputString);
You don't have to take care of preserve inputString using a copy, String are immutable so changes only apply to the scope of the split method.
I am trying to read from a txt file (book) and then add every line of it to a linkedlist. However, when I run the code, I got an outofmemory error at l.add(line);. Could you tell me what I am doing wrong with this code? Or, is there a better way to store the String values instead of LinkedList?
Thanks a lot!
public Book (String bookname) throws java.io.IOException{
f = new FileReader(bookname);
b = new BufferedReader(f);
l = new LinkedList<String>();
String line = b.readLine();
while (line != null) {
l.add(line);
}
b.close();
}
As others point out, you have created an infinite, memory-consuming loop. A common idiom for reading from a BufferedReader is:
String line;
while ( ( line = b.readLine() ) != null) {
l.add(line);
}
I guess it is possible that the content of the book is just too large to fit into memory all at once anyway. You can increase the memory available to the JVM by using the Xmx argument, ie:
java -Xmx1G MyClass
The default value for this is 64 Mb, which isn't much these days.
You are adding the same line over and over, until memory runs out:
String line = b.readLine();
while (line != null) {
l.add(line);
}
See? The line-variable is read outside the loop, and never changes within the loop.
Probably you should replace
while (line != null) {
l.add(line);
}
with
while (line = b.readLine()) {
l.add(line);
}
While loop never quits, because variable
line is never null. Try this:
String line = "";
while ((line = b.readLine())!= null)
{
l.add(line);
}
b.close();
Quite simply, the memory required to store the strings (and everything else in your program) exceeded the total free memory available on the heap.
Other lists will have slightly different amounts of overhead, but realistically the memory requirements of the structure itself is likely to be insignificant compared to the data it contains. In other words, switching to a different list implementation might let you read a few more lines before falling over, but it won't solve the problem.
If you haven't increased the heap space of the java application, it might be running with fairly low defaults. In which case, you should consider providing the following command-line argument to your invocation of java:
-Xmx512m
(where the 512m implies 512 megabytes of heap space; you could use e.g. -Xmx2g or whatever else you feel is appropriate.)
On the other hand, if you're already running with a large heap (much larger than the total size of the Strings you want to hold in memory), this could point to a memory problem somewhere else. How many characters are there in the book? It will take at least twice that many bytes to store all of the lines, and probably 20% or so more than that to account for overhead. If your calculations indicate that your current heap space should be able to hold all of this data, there might be some trouble elsewhere. Otherwise, you now know what you'd need to increase you heap to as a minimum.
(As an aside, trying to process large amounts of input as a single batch can often run into memory issues - what if you want to process an 8GB text file? Often it's better to process smaller chunks sequentially, in some sort of stream. For example, if you wanted to uppercase every character and write it back out to a different file, you could do this a line at a time rather than reading the whole book into memory first.)
I agree with mjg123. And be careful with the outofmemory expection. and i request you to have to look at this blog for more details of how to handle such situations
Click here