Loading and processing very large files with java

Loading and processing very large files with java - java

I'm trying to load in a csv file with a huge amount of lines (>5 million) but it slows down massively when trying to process them all into an arraylist of each value
I've tried a few different variations of reading and removing from the input list i loaded from the file, but it still ends up running out of heapspace, even when i allocate 14gb to the process, while the file is only 2gb
I know i need to be removing values so that i dont end up with duplicate references in memory, so that I dont end up with an arraylist of lines and also an arraylist of the individual comma seperated values, but i have no idea how to do something like that
Edit : For reference, in this particular situation, data should end up containing 16 * 5 million values.
If there's a more elegant solution, i'm all for it
The intention when loading this file is to process it as a database, with the appropriate methods like select and select where, all handled by a sheet class. It worked just fine with my smaller sample file of 36k lines, but i guess it doesnt scale very well
Current code :
//Load method to load it from file
private static CSV loadCSV(String filename, boolean absolute)
{
String fullname = "";
if (!absolute)
{
fullname = baseDirectory + filename;
if (!Load.exists(fullname,false))
return null;
}
else if (absolute)
{
fullname = filename;
if (!Load.exists(fullname,false))
return null;
}
ArrayList<String> output = new ArrayList<String>();
AtomicInteger atomicInteger = new AtomicInteger(0);
try (Stream<String> stream = Files.lines(Paths.get(fullname)))
{
stream.forEach(t -> {
output.add(t);
atomicInteger.getAndIncrement();
if (atomicInteger.get() % 10000 == 0)
{
Log.log("Lines done " + output.size());
}
});
CSV c = new CSV(output);
return c;
}
catch (IOException e)
{
Log.log("Error reading file " + fullname,3,"FileIO");
e.printStackTrace();
}
return null;
}
//Process method inside CSV class
public CSV(List<String> output)
{
Log.log("Inside csv " + output.size());
ListIterator<String> iterator = output.listIterator();
while (iterator.hasNext())
{
ArrayList<String> d = new ArrayList<String>(Arrays.asList(iterator.next().split(splitter,-1)));
data.add(d);
iterator.remove();
}
}

You need to use any database, which provide required functionality for your task (select, group).
Any database can effective read and aggregate 5 million rows.
Don't try to use "operations on ArrayList", it's works good only on small dataset.

I think some key concepts are missing here:
You said the file size is 2GB. That does not mean that when you load that file data in an ArrayList, the size in memory would also be 2GB. Why? Usually files store data using UTF-8 character encoding, whereas JVM internally stores String values using UTF-16. So, assuming your file contains only ASCII characters, each character occupies 1 byte in the filesystem whereas 2 bytes in memory. Assuming (for the sake of simplicity) all String values are unique, there will be space required to store the String references which are 32 bits each (assuming a 64 bits system with compressed oop). How much is your heap (excluding other memory areas)? How much is your eden space and old space? I'll come back to this again shortly.
In your code, you don't specify ArrayList size. This is a blunder in this case. Why? JVM creates a small ArrayList. After sometime JVM sees that this guy keeps pumping in data. Let's create a bigger ArrayList and copy the data of the old ArrayList into the new list. This event has some deeper implications when you are dealing with such huge volume of data: firstly, note that both the old and new arrays (with millions of entries) are in memory simultaneously occupying space, secondly unnecessarily data copy happens from one array to another - not once or twice but repeatedly, everytime the array run out of space. What happens to the old array? Well it's discarded and needs to be garbage collected. So, these repeated array copy and garbage collections slow down the process. CPU is really working hard here. What happens when your data no longer fits into the young generation (which is smaller than heap)? Maybe you need to see the behaviour using something like JVisualVM.
All in all, what I mean to say is there are good number of reasons why a 2GB file fills up your much larger heap and why your process performance is poor.

I would have a method that took a line read from the file as parameter and split it into a list of strings and then returned that list. I would then add that list to the CSV object in the file reading loop. That would mean only one large collection instead of two and the read lines could be freed from memory quicker.
Something like this
CSV csv = new CSV();
try (Stream<String> stream = Files.lines(Paths.get(fullname))) {
stream.forEach(t -> {
List<String> splittedString = splitFileRow(t);
csv.add(splittedString);
});

Trying to solve this problem using pure Java it is overwhelming. I suggest using a processing engine like Apache Spark that can process the file in a distributed way, by increasing the level of parallelism.
Apache Spark has specific APIs to load CSV file:
spark.read.format("csv").option("header", "true").load("../Downloads/*.csv")
You can transform it into an RDD, or Dataframe and perform operations on it.
You can find more online, or here

Related

Merge 2 large csv files using inner join

I need the advice from someone who knows very well java and the memory issues. I have a large CSV files (something like 500mb each) and I need to merge these files in one using only 64mb of xmx. I've tried to do it different ways, but nothing works - always got memory exception. What should I do to make it work properly?
The task is:
Develop a simple implementation that joins two input tables in a reasonably efficient way and can store both tables in RAM if needed.
My code works, but it takes alot of memory, so can't fit at 64mb.
public class ImprovedInnerJoin {
public static void main(String[] args) throws IOException {
RandomAccessFile firstFile = new RandomAccessFile("input_A.csv", "r");
FileChannel firstChannel = firstFile.getChannel();
RandomAccessFile secondFile = new RandomAccessFile("input_B.csv", "r");
FileChannel secondChannel = secondFile.getChannel();
RandomAccessFile resultFile = new RandomAccessFile("result2.csv", "rw");
FileChannel resultChannel = resultFile.getChannel().position(0);
ByteBuffer resultBuffer = ByteBuffer.allocate(40);
ByteBuffer firstBuffer = ByteBuffer.allocate(25);
ByteBuffer secondBuffer = ByteBuffer.allocate(25);
while (secondChannel.position() != secondChannel.size()){
Map <String, List<String>>table2Part = new HashMap();
for (int i = 0; i < secondChannel.size(); ++i){
if (secondChannel.read(secondBuffer) == -1)
break;
secondBuffer.rewind();
String[] table2Tuple = (new String(secondBuffer.array(), Charset.defaultCharset())).split(",");
if (!table2Part.containsKey(table2Tuple[0]))
table2Part.put(table2Tuple[0], new ArrayList());
table2Part.get(table2Tuple[0]).add(table2Tuple[1]);
secondBuffer.clear();
}
Set <String> taple2keys = table2Part.keySet();
while (firstChannel.read(firstBuffer) != -1){
firstBuffer.rewind();
String[] table1Tuple = (new String(firstBuffer.array(), Charset.defaultCharset())).split(",");
for (String table2key : taple2keys){
if (table1Tuple[0].equals(table2key)){
for (String value : table2Part.get(table2key)){
String result = table1Tuple[0] + "," + table1Tuple[1].substring(0,14) + "," + value; // 0,14 or result buffer will be overflown
resultBuffer.put(result.getBytes());
resultBuffer.rewind();
while(resultBuffer.hasRemaining()){
resultChannel.write(resultBuffer);
}
resultBuffer.clear();
}
}
}
firstBuffer.clear();
}
firstChannel.position(0);
table2Part.clear();
}
firstChannel.close();
secondChannel.close();
resultChannel.close();
System.out.println("Operation completed.");
}
}

A very easy to implement version of an external join is the external hash join.
It is much easier to implement than an external merge sort join and only has one drawback (more on that later).
How does it work?
Very similar to a hashtable.
Choose a number n, which signifies how many files ("buckets") you're distributing your data into.
Then do the following:
Setup n file writers
For each of your files that you want to join and for each line:
take the hashcode of the key you want to join on
compute the modulo of the hashcode and n, that will give you k
append your csv line to the kth file writer
Flush/Close all n writers.
Now you have n, hopefully smaller, files with the guarantee that the same key will always be in the same file. Now you can run your standard HashMap/HashMultiSet based join on each of these files separately.
Limitations
Why did I mentioned hopefully smaller files? Well, it depends on the distribution of the keys and their hashcodes. Think for the worst case, all of your files have the exact same key: you only have one file and you didn't win anything from partitioning.
Similar for skewed distributions, sometimes a few of your bucket files will be too big to fit into your RAM.
Usually there are three ways out of this dilemma:
Run the algorithm again with a bigger n, so you have more buckets to distribute to
Take only the buckets that are too big and do another hash partitioning pass only on those files (so each file goes into n newly created buckets again)
Fallback to an external merge sort on the big partition files.
Sometimes all three are used in a different combinations, which is called dynamic partitioning.

If central memory is a constraint for your application but you can access a persistent file, I would create as suggested by blahfunk a temporary SQLite file to your tmp folder, read every file by chunks and merge them with a simple join. You could could create a temporary SQLite DB by giving a look to libraries such as Hibernate, just take a look to what have I found on this StackOverflow question: How to create database in Hibernate at runtime?
If you cannot perform such a task, your remaining option is to consume more cpu and load just the first row of the first file searching for a row with the same index on the second file, buffering the result and flushing them as late as possible on the output file, repeating this for every row of the first file.

Maybe you can stream the first file and turn each line into a hashcode and save all those hashcodes in memory. Then stream the second file and make a hashcode for each line as it comes in. If the hashcode is in the first file, i.e., in memory, then don't write the line, else write the line. After that, append the first file in its entirety into the result file.
This would be effectively creating an index to compare your updates to.

Improving speed and memory consumption when handling ArrayList with 100 million elements

I work with text files with short strings in it (10 digits). The size of file is approx 1.5Gb, so the number of rows is reaching 100 millions.
Every day I get another file and need to extract new elements (tens of thousands a day).
What's the best approach to solve my problem?
I tried to load data in ArrayList - it takes around 20 seconds for each file, but substraction of arrays takes forever.
I use this code:
dataNew.removeAll(dataOld);
Tried to load data in HashSets - creation of HashSets is endless.
The same with LinkedHashset.
Tried to load into ArrayLists and to sort only one of them
Collections.sort(dataNew);
but it didn't speed up the process of
dataNew.removeAll(dataOld);
Also memory consumption is rather high - sort() finishes only with heap of 15Gb (13Gb is not enough).
I've tried to use old good linux util diff and it finished the task in 76 minutes (while eating 8Gb of RAM).
So, my goal is to solve the problem in Java within 1 hour of processing time (or less, of course) and with consumption of 15Gb (or better 8-10Gb).
Any suggestions, please?
Maybe I need not alphabetic sorting of ArrayList, but something else?
UPDATE:
This is a country-wide list of invalid passports. It is published as a global list, so I need to extract delta by myself.
Data is unsorted and each row is unique. So I must compare 100M elements with 100M elements. Dataline is for example, "2404,107263". Converting to integer is not possible.
Interesting, when I increased maximum heap size to 16Gb
java -Xms5G -Xmx16G -jar utils.jar
loading to HashSet became fast (50 seconds for first file), but program gets killed by system Out-Of-Memory killer, as it eats enormous amounts of RAM while loading second file to second HashSet or ArrayList
My code is very simple:
List<String> setL = Files.readAllLines(Paths.get("filename"));
HashSet<String> dataNew = new HashSet<>(setL);
on second file the program gets
Killed
[1408341.392872] Out of memory: Kill process 20538 (java) score 489 or sacrifice child
[1408341.392874] Killed process 20531 (java) total-vm:20177160kB, anon-rss:16074268kB, file-rss:0kB
UPDATE2:
Thanks for all your ideas!
Final solution is: converting lines to Long + using fastutil library (LongOpenHashSet)
RAM consumption became 3.6Gb and processing time only 40 seconds!
Interesting observation. While starting java with default settings made loading 100 million Strings to JDK's native HashSet endless (I interrupted after 1 hour), starting with -Xmx16G speedup the process to 1 minute. But memory consumption was ridiculous (around 20Gb), processing speed was rather fine - 2 minutes.
If someone is not limited to by RAM, native JDK HashSet is not so bad in terms of speed.
p.s. Maybe the task is not clearly explained, but I do not see any opportunity not to load at least one file entirely. So, I doubt memory consumption can be further lowered by much.

First of all, don't do Files.readAllLines(Paths.get("filename")) and then pass everything to a Set, that holds unnecesserily huge amounts of data. Try to hold as few lines as possible at all times.
Read the files line-by-line and process as you go. This immediately cuts your memory usage by a lot.
Set<String> oldData = new HashSet<>();
try (BufferedReader reader = Files.newBufferedReader(Paths.get("oldData"))) {
for (String line = reader.readLine(); line != null; line = reader.readLine()) {
// process your line, maybe add to the Set for the old data?
oldData.add(line);
}
}
Set<String> newData = new HashSet<>();
try (BufferedReader reader = Files.newBufferedReader(Paths.get("newData"))) {
for (String line = reader.readLine(); line != null; line = reader.readLine()) {
// Is it enough just to remove from old data so that you'll end up with only the difference between old and new?
boolean oldRemoved = oldData.remove(line);
if (!oldRemoved) {
newData.add(line);
}
}
}
You'll end up with two sets containing only the data that is present in the old, or the new dataset, respectively.
Second of all, try to presize your containers if at all possible. Their size (usually) doubles when they reach their capacity, and that could potentially create a lot of overhead when dealing with big collections.
Also, if your data are numbers, you could just use a long and hold that instead of trying to hold instances of String? There's a lot of collection libraries that enable you to do this, e.g. Koloboke, HPPC, HPPC-RT, GS Collections, fastutil, Trove. Even their collections for Objects might serve you very well as a standard HashSet has a lot of unnecessary object allocation.

Thank's for all your ideas!
Final solution is:
converting lines to Long + using fastutil library (LongOpenHashSet)
RAM consumption became 3.6Gb and processing time only 40 seconds!
Interesting observation. While starting java with default settings made loading 100 million Strings to JDK's native HashSet endless (I interrupted after 1 hour), starting with -Xmx16G speedup the process to 1 minute. But memory consumption was ridiculous (around 20Gb), processing speed was rather fine - 2 minutes.
If someone is not limited to by RAM, native JDK HashSet is not so bad in terms of speed.
p.s. Maybe the task is not clearly explained, but I do not see any opportunity not to load at least one file entirely. So, I doubt memory consumption can be further lowered by much.

Pls split the strings into two and whatever part (str1 or str2) is repeated most use the intern() on it so to save duplication os same String again in Heap. Here i used intern() on both part just to show the sample but dont use it unless they are repeating most.
Set<MyObj> lineData = new HashSet<MyObj>();
String line = null;
BufferedReader bufferedReader = new BufferedReader(new FileReader(file.getAbsoluteFile()));
while((line = bufferedReader.readLine()) != null){
String[] data = line.split(",");
MyObj myObj = new MyObj();
myObj.setStr1(data[0].intern());
myObj.setStr1(data[1].intern());
lineData.add(myObj);
}
public class MyObj {
private String str1;
private String str2;
public String getStr1() {
return str1;
}
public void setStr1(String str1) {
this.str1 = str1;
}
#Override
public int hashCode() {
final int prime = 31;
int result = 1;
result = prime * result + ((str1 == null) ? 0 : str1.hashCode());
result = prime * result + ((str2 == null) ? 0 : str2.hashCode());
return result;
}
#Override
public boolean equals(Object obj) {
if (this == obj)
return true;
if (obj == null)
return false;
if (getClass() != obj.getClass())
return false;
Test1 other = (Test1) obj;
if (str1 == null) {
if (other.str1 != null)
return false;
} else if (!str1.equals(other.str1))
return false;
if (str2 == null) {
if (other.str2 != null)
return false;
} else if (!str2.equals(other.str2))
return false;
return true;
}
public String getStr2() {
return str2;
}
public void setStr2(String str2) {
this.str2 = str2;
}
}

Use a database; to keep things simple, use a Java-embedded database (Derby, HSQL, H2, ...). With that much information, you can really benefit from standard DB caching, time-efficient storage, and querying. Your pseudo-code would be:
if first use,
define new one-column table, setting column as primary-key
iterate through input records, for each:
insert record into table
otherwise
open database with previous records
iterate through input records, for each:
lookup record in DB, update/report as required
Alternatively, you can do even less work if you use existing "table-diff" libraries, such as DiffKit - from their tutorial:
java -jar ../diffkit-app.jar -demoDB
Then configure a connection to this demo database within your favorite
JDBC enabled database browser
[...]
Your DB browser will show you the tables TEST10_LHS_TABLE and
TEST10_RHS_TABLE (amongst others) populated with the data values from
the corresponding CSV files.
That is: DiffKit does essentially what I proposed above, loading files into database tables (they use embedded H2) and then comparing these tables through DB queries.
They accept input as CSV files; but conversion from your textual input to their CSV can be done in a streaming fashion in less than 10 lines of code. And then you just need to call their jar to do the diff, and you would get the results as tables in their embedded DB.

I made a very simple spell checker, just checking if a word was in the dictionary was too slow for whole documents. I created a map structure, and it works great.
Map<String, List<String>> dictionary;
For the key, I use the first 2 letters of the word. The list has all the words that start with the key. To speed it up a bit more you can sort the list, then use a binary search to check for existence. I'm not sure the optimum length of key, and if your key gets too long you could nest the maps. Eventually it becomes a tree. A trie structure is possibly the best actually.

You can use a trie data structure for such cases: http://www.toptal.com/java/the-trie-a-neglected-data-structure
The algorithm would be as follows:
Read the old file line by line and store each line in the trie.
Read the new file line by line and test each line whether it is
in the trie: if it is not, then it is a newly added line.
A further memory optimization can take advantage that there are only 10 digits, so 4 bits is enough to store a digit (instead of 2 bytes per character in Java). You may need to adapt the trie data structure from one of the following links:
Trie data structures - Java
http://algs4.cs.princeton.edu/52trie/TrieST.java.html

The String object holding 11 characters (up to 12 in-fact) will have a size of 64 bytes (on 64bits Java with compressed oops). The only structure that can hold so much elements and be of a reasonable size is an array:
100,000,000 * (64b per String object + 4b per reference) = 6,800,000,000b ~ 6.3Gb
So you can immediately forget about Maps, Sets, etc as they introduce too much memory overhead.. But array is actually all you need. My approach would be:
Load the "old" data into an array, sort it (this should be fast enough)
Create a back-up array of primitive booleans with same size as the loaded array (you can use the BitSet here as well)
Read line by line from the new data file. Use binary search to check if the password data exists in the old data array. If the item exist mark it's index in the boolean array/bitset as true (you get back the index from the binary search). If the item does not exists just save it somewhere (array list can serve).
When all lines are processed remove from old array all the items that have false in boolean array/bitset (check by index of course). And finally add to the array all the new data you saved somewhere.
Optionally sort the array again and save to disk, so next time you load it you can skip the initial sorting.
This should be fast enough imo. Initial sort is O(n log(n)), while the binary search is O(log(n)) thus you should end up with (excluding final removal + adding which can be max 2n):
n log(n) (sort) + n log(n) (binary check for n elements) = 2 n log(n)
There would be other optimizations possible if you would explain more on the structure of that String you have (if there is some pattern or not).

The main problem in numerous resizing ArrayList when readAllLines() occurs. Better choice is LinkedList to insert data
try (BufferedReader reader = Files.newBufferedReader(path, StandardCharsets.UTF_8)) {
List<String> result = new LinkedList<>();
for (;;) {
String line = reader.readLine();
if (line == null)
break;
result.add(line);
}
return result;
}

Too much memory while reading a dictionary file in Java

I read a dictionary that might be 100MB or so in size (sometimes gets bigger up to max 500MB). It is a simple dictionary of two columns, the first column words the second column a float value. I read the dictionary file it in this way:
BufferedReader br = new BufferedReader(new FileReader(file));
String line;
while((line = br.readLine()) != null) {
String[] cols = line.split("\t");
setIt(cols[0], cols[1]);
and for the setIt function:
public void setIt(String term, String value) {
all.put(term, new Double(value));
}
When I have a big file, it takes a long time to load it and it often goes out of memory. Even with a reasonable size file (100MB) it does need a 4GB memory in Java to be run.
Any clue how to improve it while not changing the structure of the whole package?
EDIT: I'm using a 50MB file with -Xmx1g and I still get the error.
UPDATE: There were some iterations over the file that I fixed them and now the memory problem was partially solved. Yet to try the properties and other solutions and report on that.

You are allocating a new String for every line. There is some overhead associated with a String. See Here for a calculation. This article also addresses the subject of object memory use in java.
There is a stack overflow question on the subject of more memory efficient replacements for strings here.
Is there something you can do to avoid all those allocations? For example, are there a limited number of strings that you could represent as an integer in your data structure, and then use a smaller lookup table to translate?

You can do a lot of things to reduce memory usage. for example :
1- replace String[] cols = line.split("\t"); with :
static final Pattern PATTERN = Pattern.compile("\t");
//...
String[] cols = PATTERN.split(line);
2- use .properties file to store your dictionary and simply load it this way :
Properties properties = new Properties();
//...
try (FileInputStream fileInputStream = new FileInputStream("D:/dictionary.properties")) {
properties.load(fileInputStream);
}
Map<String, Double> map = new HashMap<>();
Enumeration<?> enumeration = properties.propertyNames();
while (enumeration.hasMoreElements()){
String key = (String) enumeration.nextElement();
map.put(key, new Double(properties.getProperty(key)));
}
//...
dictionary.properties :
A = 1
B = 2
C = 3
//...
3- use StringTokenizer :
StringTokenizer tokenizer = new StringTokenizer(line, "\t");
setIt(tokenizer.nextToken(), tokenizer.nextToken());

Well my solution will deviate little bit from your code ...
Use Lucene or more specifically Lucene Dictionary or even more specifically Lucene Spell Checker depends upon what you want.
Lucene handle any amount of data with efficient memory usage ..
Your problem is that you are storing whole Dictionary in memory ... Lucene store it in file with hashing and then it take search result from file at runtime but efficiently. This save lot of memory. You can customize search depends upon your needs
Small Demo of Lucene

A few causes for this problem would be.
1). The String array cols is using up too much memory.
2). The String line might also be using too much memory, unlikely though.
3). While java is opening and reading the file its also using memory so that's also a probability.
4). Your map put will also be taking up a small amount of memory.
It might also be all these things combined, so maybe try and comment some lines out and see if works then.
The most likely cause is all these things added up is eating your memory. So a 10 megabyte file could end up being 50 megabytes. Also make sure to .close() all input steams and try to reallocate ram by splitting up your methods so variables get garbage collected.
As for doing this without changing package structure or java heap size arguments i'm not sure it will be very easy, if possible at all.
Hope this helps.

Is this leaking memory or am I just reaching the limit of objects I can keep in memory?

I have two large CSV files which contain data that is required for users of a web application to validate some info. I defined an ArrayList< String[] > and intended to keep the contents of both files in memory so I wouldn't have to read them each time a user logged in and used the application.
I'm getting a java.lang.OutOfMemoryError: Java heap space, though, when initializing the application and trying to read the second file. (It finishes reading the first file just fine but hangs when reading the second file and after a while I get that exception)
The code for reading the files is pretty straight forward:
ArrayList<String[]> tokenizedLines = new ArrayList<String[]>();
public void parseTokensFile() throws Exception {
BufferedReader bRead = null;
FileReader fRead = null;
try {
fRead = new FileReader(this.tokensFile);
bRead = new BufferedReader(fRead);
String line;
while ((line = bRead.readLine()) != null) {
tokenizedLines.add(StringUtils.split(line, fieldSeparator));
}
} catch (Exception e) {
throw new Exception("Error parsing file.");
} finally {
bRead.close();
fRead.close();
}
}
I read Java's split function could use up a lot of memory when reading large amounts of data since the substring function makes a reference to the original string, so a substring of some String will use up the same amount of memory as the original, even though we only want a few chars, so I made a simple split function to try avoiding this:
public String[] split(String inputString, String separator) {
ArrayList<String> storage = new ArrayList<String>();
String remainder = new String(inputString);
int separatorLength = separator.length();
while (remainder.length() > 0) {
int nextOccurance = remainder.indexOf(separator);
if (nextOccurance != -1) {
storage.add(new String(remainder.substring(0, nextOccurance)));
remainder = new String(remainder.substring(nextOccurance + separatorLength));
} else {
break;
}
}
storage.add(remainder);
String[] tokenizedFields = storage.toArray(new String[storage.size()]);
storage = null;
return tokenizedFields;
}
This gives me the same error though, so I'm wondering if it's not a memory leak but simply that I can't have structures with so many objects in memory. One file is about 600'000 lines long, with 5 fields per line, and the other is around 900'000 lines long with about the same amount of fields per line.
The full stacktrace is:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at xxx.xxx.xxx.StringUtils.split(StringUtils.java:16)
at xxx.xxx.xxx.GFTokensFile.parseTokensFile(GFTokensFile.java:36)
So, after the long post (sorry :P), is this a restriction of the amount of memory assigned to my JVM or am I missing something obvious and wasting resources somewhere?

Your JVM won't get more than 2GB on a 32-bit operating system with 4GB of RAM. That's one upper limit.
The second is the max heap size you specify when you start the JVM. Look at that -Xmx parameter.
The third is the fact of life that you cannot fit X units of anything into a Y sized container where X > Y. You know the size of your files. Try parsing each one individually and seeing what kind of heap they're consuming.
I'd recommend that you download Visual VM, install all the available plugins, and have it monitor your application while it's running. You'll be able to see the entire heap, perm gen space, GC collection, what objects are taking up the most memory, etc.
Getting data is invaluable for all problems, but especially ones like this. Without it, you're just guessing.

I cannot see a storage leak in the original version of the program.
The scenarios where split and similar methods can leak significant storage are rather limitted:
You have to NOT be retaining a reference to the original string that you split.
You need to be retaining references to a subset of the strings produced by the string splitting.
What happens when String.substring() is called is that it creates a new String object that shares the original String's backing array. If the original String reference is then garbage collected, then the substring String is now holding onto an array of characters that includes characters that are not "in" the substring. This can be a storage leak, depending on how long the substring is kept.
In your example, you are keeping strings that contain all characters apart for the field separator character. There is a good chance that this is actually saving space ... compared to the space used if each substring was an independent String. Certainly, it is no surprise that your version of split doesn't solve the problem.
I think you need to either increase the heap size, or change your application so that it doesn't need to keep all of the data in memory at the same time.

Try improving your code or leave data processing to a database.
The memory usage is larger as your file sizes, since the code makes redundant copies of the processed data. There is a to be processed one processed and some partial data.
String is immutable, see here, no need to use new String(...) to store the result, split does that copy already.
If you can, delegate the whole data storage and searching to a database. CSV files are easily imported/exported to databases and they do all the hard work.

While I wouldn't recommend actual string interning for what you are doing, how about using the idea behind that technique? You could use a HashSet or HashMap to make sure you only use a single String instance whenever your data contains the same sequence of characters. I mean, there must be some kind of overlap in the data, right?
On the other hand, what you might be seeing here could be a bad case of heap fragmentation. I'm not sure how the JVM handles these cases, but in the Microsoft CLR larger objects (especially arrays) will be allocated on a separate heap. Growth strategies, such as those of the ArrayList will create a larger array, then copy over the content of the previous array before releasing the reference to it. The Large Object Heap (LOH) isn't compacted in the CLR, so this growth strategy will leave huge areas of free memory that the ArrayList can no longer use.
I don't know how much of that applies to the Lava VM, but you could try building the list using LinkedList first, then dump the list content into an ArrayList or directly into an array. That way the large array of lines would be created only once, without causing any fragmentation.

Be sure that the total length of both files is lower than your heap size. You can set the max heap size using the JVM option -Xmx.
Then if you have so much content maybe you shouldn't load it entirely in memory. One time I had a similar problem and I fixed it using an index file that store index of informations in the large file. then I just had to read one line at the good offset.
Also in your split method there is some strange things.
String remainder = new String(inputString);
You don't have to take care of preserve inputString using a copy, String are immutable so changes only apply to the scope of the split method.

Why is Java HashMap slowing down?

I try to build a map with the content of a file and my code is as below:
System.out.println("begin to build the sns map....");
String basePath = PropertyReader.getProp("oldbasepath");
String pathname = basePath + "\\user_sns.txt";
FileReader fr;
Map<Integer, List<Integer>> snsMap =
new HashMap<Integer, List<Integer>>(2000000);
try {
fr = new FileReader(pathname);
BufferedReader br = new BufferedReader(fr);
String line;
int i = 1;
while ((line = br.readLine()) != null) {
System.out.println("line number: " + i);
i++;
String[] strs = line.split("\t");
int key = Integer.parseInt(strs[0]);
int value = Integer.parseInt(strs[1]);
List<Integer> list = snsMap.get(key);
//if the follower is not in the map
if(snsMap.get(key) == null)
list = new LinkedList<Integer>();
list.add(value);
snsMap.put(key, list);
System.out.println("map size: " + snsMap.size());
}
} catch (IOException e) {
e.printStackTrace();
}
System.out.println("finish building the sns map....");
return snsMap;
The program is very fast at first but gets much slowly when the information printed is :
map size: 1138338
line number: 30923602
map size: 1138338
line number: 30923603
....
I try to find to reason with two System.out.println() clauses to judge the preformance of BufferedReader and HashMap instead of a Java profiler.
Sometimes it takes a while to get the information of the map size after getting the line number information, and sometimes, it takes a while to get the information of the line number information after get the map size. My question is: which makes my program slow? the BufferedReader for a big file or HashMap for a big map?

If you are testing this from inside Eclipse, you should be aware of the huge performance penalty of writing to stdout/stderr, due to Eclipse's capturing that ouptut in the Console view. Printing inside a tight loop is always a performance issue, even outside of Eclipse.
But, if what you are complaining about is the slowdown experienced after processing 30 million lines, then I bet it's a memory issue. First it slows down due to intense GC'ing and then it breaks with OutOfMemoryError.

You will have to check you program with some profiling tools to understand why it is slow.
In general file access is much more slower than in memory operations (unless you are constrained in memory and doing excess GC) so the guess would be that reading file could be the slower here.

Before you profiled, you will not know what is slow and what isn't.
Most likely, the System.out will show up as being the bottleneck, and you'll then have to profile without them again. System.out is the worst thing you can do for finding performance bottlenecks, because in doing so you usually add an even worse bottleneck.
An obivous optimization for your code is to move the line
snsMap.put(key, list);
into the if statement. You only need to put this when you created a new list. Otherwise, the put will just replace the current value with itself.
Java cost associated with Integer objects (and in particular the use of Integers in the Java Collections API) is largely a memory (and thus Garbage Collection!) issue. You can sometimes get significant gains by using primitive collections such as GNU trove, depending how well you can adjust your code to use them efficiently. Most of the gains of Trove are in memory usage. Definitely try rewriting your code to use TIntArrayList and TIntObjectMap from GNU trove. I'd avoid linked lists, too, in particular for primitive types.
Roughly estimated, a HashMap<Integer, List<Integer>> needs at least 3*16 bytes per entry. The doubly linked list again needs at least 2*16 bytes per entry stored. 1m keys + 30m values ~ 1 GB. No overhead included yet. With GNU trove TIntObjectHash<TIntArrayList> that should be 4+4+16 bytes per key and 4 bytes per value, so 144 MB. The overhead is probably similar for both.
The reason that Trove uses less memory is because the types are specialized for primitive values such as int. They will store the int values directly, thus using 4 bytes to store each.
A Java collections HashMap consists of many objects. It roughly looks like this: there are Entry objects that point to a key and a value object each. These must be objects, because of the way generics are handled in Java. In your case, the key will be an Integer object, which uses 16 bytes (4 bytes mark, 4 bytes type, 4 bytes actual int value, 4 bytes padding) AFAIK. These are all 32 bit system estimates. So a single entry in the HashMap will probably need some 16 (entry) + 16 (Integer key) + 32 (yet empty LinkedList) bytes of memory that all need to be considered for garbage collection.
If you have lots of Integer objects, it just will take 4 times as much memory as if you could store everything using int primitives. This is the cost you pay for the clean OOP principles realized in Java.

The best way is to run your program with profiler (for example, JProfile) and see what parts are slow. Also debug output can slow your program, for example.

Hash Map is not slow, but in reality its the fastest among the maps. HashTable is the only thread safe among maps, and can be slow sometimes.
Important note: Close the BufferedReader and File after u read the data... this might help.
eg: br.close()
file.close()
Please check you system processes from task manager, there may be too may processes running in the background.
Sometimes eclipse is real resource heavy, so try to run it from console to check it.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.