Copying a java text file into a String

Copying a java text file into a String - java

I run into the following errors when i try to store a large file into a string.
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2882)
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:515)
at java.lang.StringBuffer.append(StringBuffer.java:306)
at rdr2str.ReaderToString.main(ReaderToString.java:52)
As is evident, i am running out of heap space. Basically my pgm looks like something like this.
FileReader fr = new FileReader(<filepath>);
sb = new StringBuffer();
char[] b = new char[BLKSIZ];
while ((n = fr.read(b)) > 0)
sb.append(b, 0, n);
fileString = sb.toString();
Can someone suggest me why i am running into heap space error? Thanks.

You are running out of memory because the way you've written your program, it requires storing the entire, arbitrarily large file in memory. You have 2 options:
You can increase the memory by passing command line switches to the JVM:
java -Xms<initial heap size> -Xmx<maximum heap size>
You can rewrite your logic so that it deals with the file data as it streams in, thereby keeping your program's memory footprint low.
I recommend the second option. It's more work but it's the right way to go.
EDIT: To determine your system's defaults for initial and max heap size, you can use this code snippet (which I stole from a JavaRanch thread):
public class HeapSize {
public static void main(String[] args){
long kb = 1024;
long heapSize = Runtime.getRuntime().totalMemory();
long maxHeapSize = Runtime.getRuntime().maxMemory();
System.out.println("Heap Size (KB): " + heapSize/1024);
System.out.println("Max Heap Size (KB): " + maxHeapSize/1024);
}
}

You allocate a small StringBuffer that gets longer and longer. Preallocate according to file size, and you will also be a LOT faster.
Note that java is Unicode, the string likely not, so you use... twice the size in memory.
Depending on VM (32 bit? 64 bit?) and the limits set (http://www.devx.com/tips/Tip/14688) you may simply not have enough memory available. How large is the file actually?

In the OP, your program is aborting while the StringBuffer is being expanded. You should preallocate that to the size you need or at least close to it. When StringBuffer must expand it needs RAM for the original capacity and the new capacity. As TomTom said too, your file is likely 8-bit characters so will be converted to 16-bit unicode in memory so it will double in size.
The program has not even encountered yet the next doubling - that is StringBuffer.toString() in Java 6 will allocate a new String and the internal char[] will be copied again (in some earlier versions of Java this was not the case). At the time of this copy you will need double the heap space - so at that moment at least 4 times what your actual files size is (30MB * 2 for byte->unicode, then 60MB * 2 for toString() call = 120MB). Once this method is finished GC will clean up the temporary classes.
If you cannot increase the heap space for your program you will have some difficulty. You cannot take the "easy" route and just return a String. You can try to do this incrementally so that you do not need to worry about the file size (one of the best solutions).
Look at your web service code in the client. It may provide a way to use a different class other than String - perhaps a java.io.Reader, java.lang.CharSequence, or a special interface, like the SAX related org.xml.sax.InputSource. Each of these can be used to build an implementation class that reads from your file in chunks as the callers needs it instead of loading the whole file at once.
For instance, if your web service handling routes can take a CharSequence then (if they are written well) you can create a special handler to return just one character at a time from the file - but buffer the input. See this similar question: How to deal with big strings and limited memory.

Kris has the answer to your problem.
You could also look at java commons fileutils' readFileToString which may be a bit more efficient.

Although this might not solve your problem, some small things you can do to make your code a bit better:
create your StringBuffer with an initial capacity the size of the String you are reading
close your filereader at the end: fr.close();

By default, Java starts with a very small maximum heap (64M on Windows at least). Is it possible you are trying to read a file that is too large?
If so you can increase the heap with the JVM parameter -Xmx256M (to set maximum heap to 256 MB)
I tried running a slightly modified version of your code:
public static void main(String[] args) throws Exception{
FileReader fr = new FileReader("<filepath>");
StringBuffer sb = new StringBuffer();
char[] b = new char[1000];
int n = 0;
while ((n = fr.read(b)) > 0)
sb.append(b, 0, n);
String fileString = sb.toString();
System.out.println(fileString);
}
on a small file (2 KB) and it worked as expected. You will need to set the JVM parameter.

Trying to read an arbitrarily large file into main memory in an application is bad design. Period. No amount of JVM settings adjustments/etc... are going to fix the core issue here. I recommend that you take a break and do some googling and reading about how to process streams in java - here's a good tutorial and here's another good tutorial to get you started.

Related

maximum limit on Java array

I am trying to create 2D array in Java as follows:
int[][] adjecancy = new int[96295][96295];
but it is failing with the following error:
JVMDUMP039I Processing dump event "systhrow", detail "java/lang/OutOfMemoryError" at 2017/04/07 11:58:55 - please wait.
JVMDUMP032I JVM requested System dump using 'C:\eclipse\workspaces\TryJavaProj\core.20170407.115855.7840.0001.dmp' in response to an event
JVMDUMP010I System dump written to C:\eclipse\workspaces\TryJavaProj\core.20170407.115855.7840.0001.dmp
JVMDUMP032I JVM requested Heap dump using 'C:\eclipse\workspaces\TryJavaProj\heapdump.20170407.115855.7840.0002.phd' in response to an event
JVMDUMP010I Heap dump written to C:\eclipse\workspaces\TryJavaProj\heapdump.20170407.115855.7840.0002.phd
A way to solve this is by increasing the JVM memory but I am trying to submit the code for an online coding challenge. There it is also failing and I will not be able to change the settings there.
Is there any standard limit or guidance for creating large arrays which one should not exceed?

int[][] adjecancy = new int[96295][96295];
When you do that you are trying to allocate 96525*96525*32 bits which is nearly 37091 MB which is nearly 37 gigs. That is highly impossible to get the memory from a PC for Java alone.
I don't think you need that much data in your hand on initialization of your program. Probably you have to look at ArrayList which gives you dynamic allocation of size and then keep on freeing up at runtime is a key to consider.
There is no limit or restriction to create an array. As long as you have memory, you can use it. But keep in mind that you should not hold a block of memory which makes JVM life hectic.

Array must obviously fit into memory. If it does not, the typical solutions are:
Do you really need int (max value 2,147,483,647)? Maybe byte (max
value 127) or short is good enough? byte is 8 times smaller than int.
Do you have really many identical values in array (like zeros)? Try to use sparse arrays.
for instance:
Map<Integer, Map<Integer, Integer>> map = new HashMap<>();
map.put(27, new HashMap<Integer, Integer>()); // row 27 exists
map.get(27).put(54, 1); // row 27, column 54 has value 1.
They need more memory per value stored, but have basically no limits on the array space (you can use Long rather than Integer as index to make them really huge).
Maybe you just do not know how long the array should be? Try ArrayList, it self-resizes. Use ArrayList of ArrayLists for 2D array.
If nothing else is helpful, use RandomAccessFile to store your overgrown data into the filesystem. 100 Gb or about are not a problem in these times on a good workstation, you just need to compute the required offset in the file. The filesystem is obviously much slower than RAM but with good SSD drive may be bearable.

It is recommended to allocate Maximum Heap Size that can be allocated is 1/4th of the Machine RAM Size.
1 int in Java takes 4 bytes and your array allocation needs approximately 37.09GB of Memory.
In that case even if I assume you are allocating Full Heap to just an Array your machine should be around 148GB RAM. That is huge.
Have a look at below.
Ref: http://docs.oracle.com/javase/8/docs/technotes/guides/vm/gc-ergonomics.html
Hope this helps.

It depends on maximum memory available to your JVM and the content type of the array. For int we have 4 bytes of memory. Now if 1 MB of memory is available on your machine , it can hold maximum of 1024 * 256 integers(1 MB = 1024 * 1024 bytes). Keeping that in mind you can create your 2D array accordingly.

Array that you can create depends upon JVM heap size.
96295*96295*4(bytes per number) = 37,090,908,100 bytes = ~34.54 GBytes. Most JVMs in competitive code judges don't have that much memory. Hence the error.
To get a good idea of what array size you can use for given heap size -
Run this code snippet with different -Xmx settings:
Scanner scanner = new Scanner(System.in);
while(true){
System.out.println("Enter 2-D array of size: ");
size = scanner.nextInt();
int [][]numbers = new int[size][size];
numbers = null;
}
e.g. with -Xmx 512M -> 2-D array of ~10k+ elements.
Generally most of online judges have ~1.5-2GB heap while evaluating submissions.

Java OutOfMemoryError in reading a large text file

I'm new to Java and working on reading very large files, need some help to understand the problem and solve it. We have got some legacy code which have to be optimized to make it run properly.The file size can vary from 10mb to 10gb only. only trouble start when file starting beyond 800mb size.
InputStream inFileReader = channelSFtp.get(path); // file reading from ssh.
byte[] localbuffer = new byte[2048];
ByteArrayOutputStream bArrStream = new ByteArrayOutputStream();
int i = 0;
while (-1 != (i = inFileReader.read(buffer))) {
bArrStream.write(localbuffer, 0, i);
}
byte[] data = bArrStream.toByteArray();
inFileReader.close();
bos.close();
We are getting the error
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2271)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
Any help would be appreciated?

Try to use java.nio.MappedByteBuffer.
http://docs.oracle.com/javase/7/docs/api/java/nio/MappedByteBuffer.html
You can map a file's content onto memory without copying it manually. High-level Operating Systems offer memory-mapping and Java has API to utilize the feature.
If my understanding is correct, memory-mapping does not load a file's entire content onto memory (meaning "loaded and unloaded partially as necessary"), so I guess a 10GB file won't eat up your memory.

Even though you can increase the JVM memory limit, it is needless and allocating a huge memory like 10GB to process a file sounds overkill and resource intensive.
Currently you are using a "ByteArrayOutputStream" which keeps an internal memory to keep the data. This line in your code keeps appending the last read 2KB file chunk to the end of this buffer:
bArrStream.write(localbuffer, 0, i);
bArrStream keeps growing and eventually you run out of memory.
Instead you should reorganize your algorithm and process the file in a streaming way:
InputStream inFileReader = channelSFtp.get(path); // file reading from ssh.
byte[] localbuffer = new byte[2048];
int i = 0;
while (-1 != (i = inFileReader.read(buffer))) {
//Deal with the current read 2KB file chunk here
}
inFileReader.close();

The Java virtual machine (JVM) runs with a fixed upper memory limit, which you can modify thus:
java -Xmx1024m ....
e.g. the above option (-Xmx...) sets the limit to 1024 megabytes. You can amend as necessary (within limits of your machine, OS etc.) Note that this is different from traditional applications which would allocate more and more memory from the OS upon demand.
However a better solution is to rework your application such that you don't need to load the whole file into memory at one go. That way you don't have to tune your JVM, and you don't impose a huge memory footprint.

You can't read 10GB Textfile in memory. You have to read X MB first, do something with it and than read the next X MB.

The problem is inherent in what you're doing. Reading entire files into memory is always and everywhere a bad idea. You're really not going to be able to read a 10GB file into memory with current technology unless you have some pretty startling hardware. Find a way to process them line by line, record by record, chunk by chunk, ...

Is it mandatory to get entire ByteArray() of output stream?
byte[] data = bArrStream.toByteArray();
Best approach is read line by line & write it line by line. You can use BufferedReader or Scanner to read large files as below.
import java.io.*;
import java.util.*;
public class FileReadExample {
public static void main(String args[]) throws FileNotFoundException {
File fileObj = new File(args[0]);
long t1 = System.currentTimeMillis();
try {
// BufferedReader object for reading the file
BufferedReader br = new BufferedReader(new FileReader(fileObj));
// Reading each line of file using BufferedReader class
String str;
while ( (str = br.readLine()) != null) {
System.out.println(str);
}
}catch(Exception err){
err.printStackTrace();
}
long t2 = System.currentTimeMillis();
System.out.println("Time taken for BufferedReader:"+(t2-t1));
t1 = System.currentTimeMillis();
try (
// Scanner object for reading the file
Scanner scnr = new Scanner(fileObj);) {
// Reading each line of file using Scanner class
while (scnr.hasNextLine()) {
String strLine = scnr.nextLine();
// print data on console
System.out.println(strLine);
}
}
t2 = System.currentTimeMillis();
System.out.println("Time taken for scanner:"+(t2-t1));
}
}
You can replace System.out with your ByteArrayOutputStream in above example.
Please have a look at below article for more details: Read Large File
Have a look at related SE question:
Scanner vs. BufferedReader

ByteArrayOutputStream writes to an in-memory buffer. If this is really how you want it to work, then you have to size the JVM heap after the maximum possible size of the input. Also, if possible, you may check the input size before even start processing to save time and resources.
The alternative approach is a streaming solution, where the amount of memory used at runtime is known (maybe configurable but still known before the program starts), but if it's feasible or not depends entirely on you application's domain (because you can't use an in-memory buffer anymore) and maybe the architecture of the rest of your code if you can't/don't want to change it.

Try using a large buffer read size may be 10 mb and then check.

Read the file iteratively linewise. This would significantly reduce memory consumption. Alternately you may use
FileUtils.lineIterator(theFile, "UTF-8");
provided by Apache Commons IO.
FileInputStream inputStream = null;
Scanner sc = null;
try {
inputStream = new FileInputStream(path);
sc = new Scanner(inputStream, "UTF-8");
while (sc.hasNextLine()) {
String line = sc.nextLine();
// System.out.println(line);
}
// note that Scanner suppresses exceptions
if (sc.ioException() != null) {
throw sc.ioException();
}
} finally {
if (inputStream != null) {
inputStream.close();
}
if (sc != null) {
sc.close();
}
}

Run Java with the command-line option -Xmx, which sets the maximum size of the heap.
See here for details..

Assuming that you are reading large txt file and the data is set line by line , use line by line reading approach. As I know you can read up to 6GB may be more.
...
// Open the file
FileInputStream fstream = new FileInputStream("textfile.txt");
BufferedReader br = new BufferedReader(new InputStreamReader(fstream));
String strLine;
//Read File Line By Line
while ((strLine = br.readLine()) != null) {
// Print the content on the console
System.out.println (strLine);
}
//Close the input stream
br.close();
Refrence for the code fragment

Short answer,
without doing anything, you can push the current limit by a factor of 1.5. It means that, if you are able to process 800MB, you can process 1200 MB. It also means that if by some trick with java -Xm .... you can move to a point where your current code can process 7GB, your problem is solved, because the 1.5 factor will take you to 10.5GB, assuming you have that space available on your system and that JVM can get it.
Long answer:
The error is pretty self-descriptive. You hit the practical memory limit on your configuration. There is a lot of speculating about the limit that you can have with JVM, I do not know enough about that, since I can not find any official information. However, you will somehow be limited by constraints like the available swap, the kernel address space usage, the memory fragmentation, etc.
What is happening now is that ByteArrayOutputStream objects are created with a default buffer of size 32 if you do not supply any size (this is your case). Whenever you call the write method on the object, there is an internal machinery that is started. The openjdk implementation release 7u40-b43 that seems to match perfectly with the output of your error, uses an internal method ensureCapacity to check that the buffer has enough room to put the bytes you want to write. If there is not enough room, another internal method grow is called to grow the size of the buffer. The method grow defines the appropriate size and calls the method copyOf from the class Arrays to do the job.
The appropriate size of the buffer is the maximum between the current size and the size riquired to hold all the content (the present content and the new content to be write).
The method copyOf from the class Arrays (follow the link) allocates the space for the new buffer, copy the content of the old buffer to the new one and return it to grow.
Your problem occurs at the allocation of the space for the new buffer, After some write, you got to a point where the available memory is exhausted: java.lang.OutOfMemoryError: Java heap space.
If we look into details, you are reading by chunks of 2048. So
your first write to the grows the size of the buffer from 32 to 2048
your second call will double it to 2*2048
your third call will take it to 2^2*2048, you have to time to write two more times before the need of allocating.
then 2^3*2048, you will have the time for 4 mores writes before allocating again.
at some point, your buffer will be of size 2^18*2048 which is 2^19*1024 or 2^9*2^20 (512 MB)
then 2^19*2048 which is 1024 MB or 1 GB
Something that is unclear in your description is that you can somehow read up to 800MB, but can no go beyond. You have to explain that to me.
I expect that your limit be exactly a power of 2 (or close if we use power of 10 units somewere). In that regard, I expect you to start having trouble immediatly above one of these: 256MB, 512 MB, 1GB, 2GB, etc.
When you hit that limit, it does not mean that you are out of memory, it simply means that it is not possible to allocate another buffer of twice the size of the buffer you already have. This observation opens room for improvement in your work: find the maximum size of buffer that you can allocate and reserve it upfront by calling the appropriate constructor
ByteArrayOutputStream bArrStream = new ByteArrayOutputStream(myMaxSize);
It has the advantage of reducing the overhead background memory allocation that happens under the hood to keep you happy. By doing this, you will be able to go to 1.5 the limit you have right now. This is simply because the last time the buffer was increased, it went from half the current size to the current size, and at some point you had both the current buffer and the old one together in memory. But you will not be able to go beyond 3 times the limit you are having now. The explanation is exactly the same.
That been said, I do not have any magic suggestion to solve the problem apart from process your data by chunks of given size, one chunk at a time. Another good approach will be to use the suggestion of Takahiko Kawasaki and use MappedByteBuffer. Keep in mind that in any case you will need at least 10 GB of physical memory or swap memory to be able to load a file of 10GB.
see

After thinking about it, I decided to put a second answer. I considered the advantages and disadvantages of putting this second answer, and the advantages are worth going for it. So here it is.
Most of the suggested considerations are forgetting a given fact: There is a builtin limit in the size of arrays (including ByteArrayOutputStream) that you can have in Java. And that limit is dictated by the bigest int value which is 2^31 - 1(little bit less than 2Giga). This means that you can only read a maximum of 2 GB (-1 byte) and put it in a single ByteArrayOutputStream. The limit might actually be smaller for array size if the VM wants more control.
My suggestion is to use an ArrayList of byte[] instead of a single byte[] holding the full content of the file. And also remove the non necessary step of putting in ByteArrayOutputStream before putting it in a final data array. Here is an example based on your original code:
InputStream inFileReader = channelSFtp.get(path); // file reading from ssh.
// good habits are good, define a buffer size
final int BUF_SIZE = (int)(Math.pow(2,30)); //1GB, let's not go close to the limit
byte[] localbuffer = new byte[BUF_SIZE];
int i = 0;
while (-1 != (i = inFileReader.read(localbuffer))) {
if(i<BUF_SIZE){
data.add( Arrays.copyOf(localbuffer, i) )
// No need to reallocate the reading buffer, we copied the data
}else{
data.add(localbuffer)
// reallocate the reading buffer
localbuffer = new byte[BUF_SIZE]
}
}
inFileReader.close();
// Process your data, keep in mind that you have a list of buffers.
// So you need to loop over the list
Simply running your program should work fine on 64 bits system with enough physical memory or swap. Now if you want to speed it up to help the VM size correctly the heap at the beginning, run with the options -Xms and -Xmx. For example if you want a heap of 12GB to be able to handle 10GB file, use java -Xms12288m -Xmx12288m YourApp

Does OutputStream.write(buf, offset, size) have memory leak on Linux?

I write a piece of java code to create 500K small files (average 40K each) on CentOS. The original code is like this:
package MyTest;
import java.io.*;
public class SimpleWriter {
public static void main(String[] args) {
String dir = args[0];
int fileCount = Integer.parseInt(args[1]);
String content="##$% SDBSDGSDF ASGSDFFSAGDHFSDSAWE^#$^HNFSGQW%##&$%^J#%##^$#UHRGSDSDNDFE$T##$UERDFASGWQR!#%!#^$##YEGEQW%!#%!!GSDHWET!^";
StringBuilder sb = new StringBuilder();
int count = 40 * 1024 / content.length();
int remainder = (40 * 1024) % content.length();
for (int i=0; i < count; i++)
{
sb.append(content);
}
if (remainder > 0)
{
sb.append(content.substring(0, remainder));
}
byte[] buf = sb.toString().getBytes();
for (int j=0; j < fileCount; j++)
{
String path = String.format("%s%sTestFile_%d.txt", dir, File.separator, j);
try{
BufferedOutputStream fs = new BufferedOutputStream(new FileOutputStream(path));
fs.write(buf);
fs.close();
}
catch(FileNotFoundException fe)
{
System.out.printf("Hit filenot found exception %s", fe.getMessage());
}
catch(IOException ie)
{
System.out.printf("Hit IO exception %s", ie.getMessage());
}
}
}
}
You can run this by issue following command:
java -jar SimpleWriter.jar my_test_dir 500000
I thought this is a simple code, but then I realize that this code is using up to 14G of memory. I know that because when I use free -m to check the memory, the free memory kept dropping, until my 15G memory VM only had 70 MB free memory left. I compiled this using Eclipse, and I compile this against JDK 1.6 and then JDK1.7. The result is the same. The funny thing is that, if I comment out fs.write(), just open and close the stream, the memory stabilized at certain point. Once I put fs.write() back, the memory allocation just go wild. 500K 40KB files is about 20G. It seems Java's stream writer never deallocate its buffer during the operation.
I once thought java GC does not have time to clean. But this make no sense since I closed the file stream for every file. I even transfer my code into C#, and running under windows, the same code producing 500K 40KB files with memory stable at certain point, not taking 14G as under CentOS. At least C#'s behavior is what I expected, but I could not believe Java perform this way. I asked my colleague who were experienced in java. They could not see anything wrong in code, but could not explain why this happened. And they admit nobody had tried to create 500K file in a loop without stop.
I also searched online and everybody says that the only thing need to pay attention to, is close the stream, which I did.
Can anyone help me to figure out what's wrong?
Can anybody also try this and tell me what you see?
BTW, some people in this community tried the code on Windows and it seemed to worked fine. I didn't tried it on windows. I only tried in Linux as I thought that where people use Java for. So, it seems this issue happened on Linux).
I also did the following to limit the JVM heap, but it take no effects
java -Xmx2048m -jar SimpleWriter.jar my_test_dir 500000

I tried to test your prog on Win XP, JDK 1.7.25. Immediately got OutOfMemoryExceptions.
While debugging, with only 3000 count (args[1]), the count variable from this code:
int count = 40 * 1024 * 1024 / content.length();
int remainder = (40 * 1024 * 1024) % content.length();
for (int i = 0; i < count; i++) {
sb.append(content);
}
count is 355449. So the String you are trying to create will be 355449 * contents long, or as you calculated, 40Mb long. I was out of memory when i was 266587, and sb was 31457266 chars long. At which point each file I get is 30Mb.
The problem does not seem with memory or GC, but with the way you crate the string.
Did you see files created or was memory eating up before any file was created?
I think your main problem is the line:
int count = 40 * 1024 * 1024 / content.length();
should be:
int count = 40 * 1024 / content.length();
to create 40K, not 40Mb files.

[Edit2: The original answer is left in italics at the end of this post]
After your clarifications in the comments, I have run your code on a windows machine (Java 1.6) and here is my findings (numbers are from VisualVM, OS memory as seen from task manager):
Example with 40K size, writing to 500K files (no parameters to JVM):
Used Heap: ~4M, Total Heap: 16M, OS memory: ~16M
Example with 40M size, writing to 500 files (parameters to JVM -Xms128m -Xmx512m. Without parameters I get an OutOfMemory error when creating StringBuilder):
Used Heap: ~265M, Heap size: ~365M, OS memory: ~365M
Especially from the second example you can see that my original explanation still stands. Yes someone would expect that most of the memory would be freed since the byte[] of the BufferedOutputStream reside in the first generation space (short lived objects) but this a) does not happen immediately and b) when GC decides to kicks in (it actually does in my case), yes it will try to clear memory but it can clear as much memory as it sees fit, not necessarily all of it. GC does not provide any guarentees that you can count upon.
So generally speaking you should give to JVM as much memory you feel comfortable with. If you need to keep the memory low for special functionalities you should try a strategy as the code example I gave down below in my original answer i.e. just don't create all those byte[] objects.
Now in your case with CentOS, it does seem that JVM's behaves strangely. Perhaps we could talk about a buggy or bad implementation. To classify it as a leak/bug though you should try to use -Xmx to restrict the heap. Also please try what Peter Lawrey suggested to not create the BufferedOutputStream at all (in the small file case) since you just write all the bytes at once.
If it still exceeds the memory limit then you have encountered a leak and should probably file a bug. (You could still complain though and they may optimize it in the future).
[Edit1: The answer below assumed that the OP's code performed as many reading operations as the write operations, so the memory usage was justifiable. The OP clarified this is not the case, so his question is not answered
"...my 15G memory VM..."
If you give the JVM as much memory why it should try to run GC? As far as the JVM is concerned it is allowed to get as much memory from the system and run GC only when it thinks that is appropriate to do so.
Each execution of BufferedOutputStream will allocate a buffer of 8K size by default. JVM will try to reclaim that memory only when it needs to. This is the expected behaviour.
Do not confuse the memory that you see as free from the system's point of view and from the JVM's point of view. As far the system is concerned the memory is allocated and will be released when the JVM shuts down. As far the JVM's is concerned all the byte[] arrays allocated from BufferedOutputStream are not in use any more, it is "free" memory and will be reclaimed if it needs to.
If for some reason you don't desire this behaviour you could try the following: Extend the BufferedOutputStream class (e.g. create a ReusableBufferedOutputStream class) and add a new method e.g. reUseWithStream(OutputStream os). This method would then clear the internal byte[], flush and close the previous stream, reset any variables used and set the new stream. Your code then would become as below:
// intialize once
ReusableBufferedOutputStream fs = new ReusableBufferedOutputStream();
for (int i=0; i < fileCount; i ++)
{
String path = String.format("%s%sTestFile_%d.txt", dir, File.separator, i);
//set the new stream to be buffered and read
fs.reUseWithStream(new FileOutputStream(path));
fs.write(this._buf, 0, this._buf.length); // this._buf was allocated once, 40K long contain text
}
fs.close(); // Close the stream after we are done
Using the above approach you will avoid creating many byte[]. However I don't see any problem with the expected behaviour neither you mention any problem other than "I see it takes too much memory". You have congifured it to use it after all.]

Best way to load a large file into arraylist in java

I have a file whose size is about 300mb. I want to read the contents line by line and then add it into ArrayList. So I have made an object of array list a1 , then reading the file using BufferedReader , after that when I add the lines from file into ArrayList it gives an error Exception in thread "main" java.lang.OutOfMemoryError: Java heap space.
Please tell me what should be the solution for this.
public static void main(String[] args) {
// TODO Auto-generated method stub
try {
FileReader file = new FileReader(
"/home/dmdd/Desktop/AsiaData/RawData/AllupperairVcomponent.txt");
ArrayList a1 = new ArrayList();
BufferedReader br = new BufferedReader(file);
String line = "";
while ((line = br.readLine()) != null) {
a1.add(line);
}
} catch (Exception e) {
// TODO: handle exception
e.printStackTrace();
}
}

Naively, increase the size of the heap via the Xmx command line argument (see this excellent answer for some guidance)
This'll only work up to a point though, instead consider structuring your data so that the memory requirements are minimized. Do you need the whole thing in memory at once? Perhaps you only need to test whether an item is in that set, consider using a hash or a bloom filter (etc).

Just increase the heap size of Java
java -Xmx250m
If you running your project from IDE set -Xmx250m in arguments.
250m is 250mb

If you have to have it in memory, you could try increasing the heap size by passing the -mx option to the java executable.
It may also be worth considering the question if you really need all that data in memory at the same time. It could be that you can either process it sequentially, or keep most or all of it on disk.

Pass -Xmx1024m to increase your heap sapce to 1024 mb.
java -Xms1024m -Xmx512m HelloWorld
You can increase up-to 4GB on a 32 bit system and on a 64 bit system you can go much higher.

Use java.nio.file.Files.readAllLines, it returns List<String>. And if you're getting OOME increase heap size as java -Xmx1024m

I agree with #Murali partly this will fix the problem you are facing. But it is advisable to use Caching when handling large files. What if the file size becomes 500Mb in a rare case. Make use of a Caching API like Memcached this will eliminate Memory Outages in JVM.

If you can: process the file in batches of 10000 lines or so.
read 10k lines
process
repeat until done

Is this leaking memory or am I just reaching the limit of objects I can keep in memory?

I have two large CSV files which contain data that is required for users of a web application to validate some info. I defined an ArrayList< String[] > and intended to keep the contents of both files in memory so I wouldn't have to read them each time a user logged in and used the application.
I'm getting a java.lang.OutOfMemoryError: Java heap space, though, when initializing the application and trying to read the second file. (It finishes reading the first file just fine but hangs when reading the second file and after a while I get that exception)
The code for reading the files is pretty straight forward:
ArrayList<String[]> tokenizedLines = new ArrayList<String[]>();
public void parseTokensFile() throws Exception {
BufferedReader bRead = null;
FileReader fRead = null;
try {
fRead = new FileReader(this.tokensFile);
bRead = new BufferedReader(fRead);
String line;
while ((line = bRead.readLine()) != null) {
tokenizedLines.add(StringUtils.split(line, fieldSeparator));
}
} catch (Exception e) {
throw new Exception("Error parsing file.");
} finally {
bRead.close();
fRead.close();
}
}
I read Java's split function could use up a lot of memory when reading large amounts of data since the substring function makes a reference to the original string, so a substring of some String will use up the same amount of memory as the original, even though we only want a few chars, so I made a simple split function to try avoiding this:
public String[] split(String inputString, String separator) {
ArrayList<String> storage = new ArrayList<String>();
String remainder = new String(inputString);
int separatorLength = separator.length();
while (remainder.length() > 0) {
int nextOccurance = remainder.indexOf(separator);
if (nextOccurance != -1) {
storage.add(new String(remainder.substring(0, nextOccurance)));
remainder = new String(remainder.substring(nextOccurance + separatorLength));
} else {
break;
}
}
storage.add(remainder);
String[] tokenizedFields = storage.toArray(new String[storage.size()]);
storage = null;
return tokenizedFields;
}
This gives me the same error though, so I'm wondering if it's not a memory leak but simply that I can't have structures with so many objects in memory. One file is about 600'000 lines long, with 5 fields per line, and the other is around 900'000 lines long with about the same amount of fields per line.
The full stacktrace is:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at xxx.xxx.xxx.StringUtils.split(StringUtils.java:16)
at xxx.xxx.xxx.GFTokensFile.parseTokensFile(GFTokensFile.java:36)
So, after the long post (sorry :P), is this a restriction of the amount of memory assigned to my JVM or am I missing something obvious and wasting resources somewhere?

Your JVM won't get more than 2GB on a 32-bit operating system with 4GB of RAM. That's one upper limit.
The second is the max heap size you specify when you start the JVM. Look at that -Xmx parameter.
The third is the fact of life that you cannot fit X units of anything into a Y sized container where X > Y. You know the size of your files. Try parsing each one individually and seeing what kind of heap they're consuming.
I'd recommend that you download Visual VM, install all the available plugins, and have it monitor your application while it's running. You'll be able to see the entire heap, perm gen space, GC collection, what objects are taking up the most memory, etc.
Getting data is invaluable for all problems, but especially ones like this. Without it, you're just guessing.

I cannot see a storage leak in the original version of the program.
The scenarios where split and similar methods can leak significant storage are rather limitted:
You have to NOT be retaining a reference to the original string that you split.
You need to be retaining references to a subset of the strings produced by the string splitting.
What happens when String.substring() is called is that it creates a new String object that shares the original String's backing array. If the original String reference is then garbage collected, then the substring String is now holding onto an array of characters that includes characters that are not "in" the substring. This can be a storage leak, depending on how long the substring is kept.
In your example, you are keeping strings that contain all characters apart for the field separator character. There is a good chance that this is actually saving space ... compared to the space used if each substring was an independent String. Certainly, it is no surprise that your version of split doesn't solve the problem.
I think you need to either increase the heap size, or change your application so that it doesn't need to keep all of the data in memory at the same time.

Try improving your code or leave data processing to a database.
The memory usage is larger as your file sizes, since the code makes redundant copies of the processed data. There is a to be processed one processed and some partial data.
String is immutable, see here, no need to use new String(...) to store the result, split does that copy already.
If you can, delegate the whole data storage and searching to a database. CSV files are easily imported/exported to databases and they do all the hard work.

While I wouldn't recommend actual string interning for what you are doing, how about using the idea behind that technique? You could use a HashSet or HashMap to make sure you only use a single String instance whenever your data contains the same sequence of characters. I mean, there must be some kind of overlap in the data, right?
On the other hand, what you might be seeing here could be a bad case of heap fragmentation. I'm not sure how the JVM handles these cases, but in the Microsoft CLR larger objects (especially arrays) will be allocated on a separate heap. Growth strategies, such as those of the ArrayList will create a larger array, then copy over the content of the previous array before releasing the reference to it. The Large Object Heap (LOH) isn't compacted in the CLR, so this growth strategy will leave huge areas of free memory that the ArrayList can no longer use.
I don't know how much of that applies to the Lava VM, but you could try building the list using LinkedList first, then dump the list content into an ArrayList or directly into an array. That way the large array of lines would be created only once, without causing any fragmentation.

Be sure that the total length of both files is lower than your heap size. You can set the max heap size using the JVM option -Xmx.
Then if you have so much content maybe you shouldn't load it entirely in memory. One time I had a similar problem and I fixed it using an index file that store index of informations in the large file. then I just had to read one line at the good offset.
Also in your split method there is some strange things.
String remainder = new String(inputString);
You don't have to take care of preserve inputString using a copy, String are immutable so changes only apply to the scope of the split method.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.