Java heap space error while reading file in byte array

Java heap space error while reading file in byte array - java

I am getting java out of heap error while using following code. Can someone tell me what I am doing wrong here ?
On debugging I see taht value of length is 709582875
In main function
File file = new File(fileLocation+fileName);
if(file.exists()){
s3Client.upload(bucketName,fileName,getBytesFromFile(file));
}
// Returns the contents of the file in a byte array.
public static byte[] getBytesFromFile(File file) throws IOException {
InputStream is = new FileInputStream(file);
// Get the size of the file
long length = file.length();
// You cannot create an array using a long type.
// It needs to be an int type.
// Before converting to an int type, check
// to ensure that file is not larger than Integer.MAX_VALUE.
if (length > Integer.MAX_VALUE) {
// File is too large
log.debug("file is too large"+length);
System.out.println("file is too large"+length);
}
if (length < Integer.MIN_VALUE || length > Integer.MAX_VALUE) {
throw new IOException
(length + " cannot be cast to int without changing its value.");
}
// return "test".getBytes();
// Create the byte array to hold the data
try{
byte[] bytes = new byte[(int)length];
}
catch(OutOfMemoryError e){ System.out.println(e.getStackTrace().toString());}
// Read in the bytes
int offset = 0;
int numRead = 0;
while (offset < bytes.length
&& (numRead=is.read(bytes, offset, bytes.length-offset)) >= 0) {
offset += numRead;
}
// Ensure all the bytes have been read in
if (offset < bytes.length) {
throw new IOException("Could not completely read file "+file.getName());
}
// Close the input stream and return bytes
is.close();
return bytes;
}

The problem is that the byte array you are allocating is too large and it use up the heap space.
You may try running your program with -Xms and -Xmx option to specify the min and max heap space the java virtual machine uses to run your program.
But I suggest you not to read the whole file into a byte array to process it. you can read part of it into a small byte array, process the portion, and continue to the next part. This way uses less heap space.

You are consuming 709582875 bytes (about 677MB) at the moment the byte array in the try block is allocated. This is quite large by conventional personal computing standards, and would consume most (if not all) of the memory of a JVM started with default settings.
Some information on default JVM memory settings can be found here

Try to increase heap size allocated by the Java Virtual Machine (JVM),
something like:
java -Xms<initial heap size> -Xmx<maximum heap size>
For example:
java -Xms64m -Xmx256m HelloWorld

Donot create such a huge byte[] array. Your heap may go out of memory. It is bad idea to create byte[] array of file length for such a large file. create small byte array and read the file in chunk by chunk basis

need some jvm tuning
java -Xms256m -Xmx1024m

Is there a particular reason you nee to read the whole file at once as a byte[]? Can you use a memory mapped ByteBuffer instead as this uses very little heap regardless of the size of the file.

Related

what is the better way to substring large text?

Suppose my file is 2GB, I want some specific data from one. index to another index(considering specific data 300MB between two index), what is the better way to do that?? I tried substring but throwing out of memory exception. Please suggest better way to do same.

In general, assuming that 2GB file is on disk, and you want to read some part from it into memory, you absolutely don't have to read the whole 2GB into memory first.
The most straightforward solution is using Random Access File
The point is that it provides an abstraction of a pointer that can be moved back and forth over a big file and once you're set you can read bytes from the place the pointer points on.
RandomAccessFile file = new RandomAccessFile(path, "r");
file.seek(position);
byte[] bytes = new byte[size];
file.read(bytes);
file.close();

Reading the file by character and writing them to the output file can solve the issue. Since it won't load the whole file at once.
So, the process will be - read the input file by character, continue to the desired substring start index, then start writing to an output file until the end of the substring.
If you are getting Exception in thread "main" java.lang.OutOfMemoryError: Java heap space, you can try increasing the heap size if you really need to read the file at once and you are sure that String size won't go past max String size limit.
The following snippet shows the idea above -
import java.io.*;
public class LargeFileSubstr {
public static void main(String[] args) throws IOException {
BufferedReader r = new BufferedReader(new FileReader("/Users/me/Downloads/big.txt"));
try (PrintWriter wr = new PrintWriter(new FileWriter("/Users/me/Downloads/big_substr.txt"))) {
int startIndex = 100;
int endIndex = 200;
int pointer = 0;
int ch;
while ((ch = r.read()) != -1) {
if (pointer > endIndex) {
break;
}
if (pointer >= startIndex) {
wr.print((char) ch);
}
pointer++;
}
}
}
}
I have tried this to take a 200MB substring out of 2GB file, works pretty reasonably fast.

Java TCP-Sockets transmit files larger than 4gb [duplicate]

This question already has answers here:
Java multiple file transfer over socket
(3 answers)
Closed 5 years ago.
I am trying to transfer a file that is greater than 4gb using the Java SocketsAPI. I am already reading it via InputStreams and writing it via OutputStreams. However, analyzing the transmitted packets in Wireshark, I realise that the Sequence number of the TCP-packets is incremented by the byte-length of the packet, which seems to be 1440byte.
This leads to the behavior that when I try to send a file greater than 4gb, the total size of the Sequence-Number field of TCP is exceeded, leading to lots of error packages, but no error in Java.
My code for transmission currently looks like this:
DataOutputStream fileTransmissionStream = new DataOutputStream(transmissionSocket.getOutputStream());
FileInputStream fis = new FileInputStream(toBeSent);
int totalFileSize = fis.available();
fileTransmissionStream.writeInt(totalFileSize);
while (totalFileSize >0){
if(totalFileSize >= FileTransmissionManagementService.splittedTransmissionSize){
sendBytes = new byte[FileTransmissionManagementService.splittedTransmissionSize];
fis.read(sendBytes);
totalFileSize -= FileTransmissionManagementService.splittedTransmissionSize;
} else {
sendBytes = new byte[totalFileSize];
fis.read(sendBytes);
totalFileSize = 0;
}
byte[] encryptedBytes = DataEncryptor.encrypt(sendBytes);
/*byte[] bytesx = ByteBuffer.allocate(4).putInt(encryptedBytes.length).array();
fileTransmissionStream.write(bytesx,0,4);*/
fileTransmissionStream.writeInt(encryptedBytes.length);
fileTransmissionStream.write(encryptedBytes, 0, encryptedBytes.length);
What exactly have I done wrong in this situation, or is it not possible to transmit files greater than 4gb via one Socket?

TCP can handle infinitely long data streams. There is no problem with the sequence number wrapping around. As it is initially random, that can happen almost immediately, regardless of the length of the stream. The problems are in your code:
DataOutputStream fileTransmissionStream = new DataOutputStream(transmissionSocket.getOutputStream());
FileInputStream fis = new FileInputStream(toBeSent);
int totalFileSize = fis.available();
Classic misuse of available(). Have a look at the Javadoc and see what it's really for. This is also where your basic problem lies, as values > 2G don't fit into an int, so there is a truncation. You should be using File.length(), and storing it into a long.
fileTransmissionStream.writeInt(totalFileSize);
while (totalFileSize >0){
if(totalFileSize >= FileTransmissionManagementService.splittedTransmissionSize){
sendBytes = new byte[FileTransmissionManagementService.splittedTransmissionSize];
fis.read(sendBytes);
Here you are ignoring the result of read() here. It isn't guaranteed to fill the buffer: that's why it returns a value. See, again, the Javadoc.
totalFileSize -= FileTransmissionManagementService.splittedTransmissionSize;
} else {
sendBytes = new byte[totalFileSize];
Here you are assuming the file size fits into an int, and assuming the bytes fit into memory.
fis.read(sendBytes);
See above re read().
totalFileSize = 0;
}
byte[] encryptedBytes = DataEncryptor.encrypt(sendBytes);
/*byte[] bytesx = ByteBuffer.allocate(4).putInt(encryptedBytes.length).array();
fileTransmissionStream.write(bytesx,0,4);*/
We're not interested in your commented-out code.
fileTransmissionStream.writeInt(encryptedBytes.length);
fileTransmissionStream.write(encryptedBytes, 0, encryptedBytes.length);
You don't need all this crud. Use a CipherOutputStream to take care of the encryption, or better still SSL, and use the following copy loop:
byte[] buffer = new byte[8192]; // or much more if you like, but there are diminishing returns
int count;
while ((count = in.read(buffer)) > 0)
{
out.write(buffer, 0, count);
}

It seems that your protocol for the transmission is:
Send total file length in an int.
For each bunch of bytes read,
Send number of encrypted bytes ahead in an int,
Send the entrypted bytes themselves.
The basic problem, beyond the misinterpretations of the documentation that were pointed out in #EJP's answer, is with this very protocol.
You assume that the file length can be sent oven in an int. This means the length it sends cannot be more than Integer.MAX_VALUE. Of course, this limits you to files of 2G length (remember Java integers are signed).
If you take a look at the Files.size() method, which is a method for getting the actual file size in bytes, you'll see that it returns long. A long will accommodate files larger than 2GB, and larger than 4GB. So in fact, your protocol should at the very least be defined to start with a long rather than an int field.
The size problem really has nothing at all to do with the TCP packets.

Java - download a file through network with a buffer

i want to read from a network stream and write the bytes to a file, directly.
But every time i run the program very few bytes are written to the file actually.
Java:
InputStream in = uc.getInputStream();
int clength=uc.getContentLength();
byte[] barr = new byte[clength];
int offset=0;
int totalwritten=0;
int i;
int wrote=0;
OutputStream out = new FileOutputStream("file.xlsx");
while(in.available()!=0) {
wrote=in.read(barr, offset, clength-offset);
out.write(barr, offset, wrote);
offset+=wrote;
totalwritten+=wrote;
}
System.out.println("Written: "+totalwritten+" of "+clength);
out.flush();

That's because available() doesn't do what you think it does. Read its API documentation. You should simply read until the number of bytes read, returned by read(), is -1. Or even simpler, use Files.copy():
Files.copy(in, new File("file.xlsx").toPath());
Using a buffer that has the size of the input stream also pretty much defeats the purpose of using a buffer, which is to only have a few bytes in memory.
If you want to reimplement copy(), the general pattern is the following:
byte[] buffer = new byte[4096]; // number of bytes in memory
int numberOfBytesRead;
while ((numberOfBytesRead = in.read(buffer)) >= 0) {
out.write(buffer, 0, numberOfBytesRead);
}

You're using .available() wrong. From Java documentation:
available() returns an estimate of the number of bytes that can be read
(or skipped over) from this input stream without blocking by the next
invocation of a method for this input stream
That means that the first time your stream is slower than your file writing speed (very soon in all probability) the while ends.
You should either prepare a thread that waits for the input until it has read all the expected content length (with a sizable timeout, of course) or just block your program in the wait, if user interaction is not a big deal.

java.lang.OutOfMemoryError with Java Vector.addElement(Object o) method

Although I know usage of Java Vectors is discouraged as its deprecated, I am stuck with a legacy code where in I don't have the luxury to modify it.
I am getting an OutOfMemoryError while trying to addElement to the Vector. Following below is my code snippet. Please let me know if I can improve the below code.
/*objOut is the Vector Object.
idx is incoming integer argument.
Val is some Object
*/
int sz = objOut.size();
if (idx == sz) {
objOut.addElement(val);
} else if (idx > sz) {
for (int i = (idx-sz); i>0; i--) {
objOut.addElement(null); // Code through OutOfMemory in this line
}
objOut.addElement(val);
} else {
objOut.setElementAt(val, idx);
}

In your program, you are trying to allocate n number of objects.
Your OS allocates some space to your JVM to work with and that space is called heap space. You get OutOfMemoryError when all of your heap space is filled and no more space is left to allocate for new objects.
So what you should do is increase your heap space with -Xmx like this:
java -Xmx 1024m YourClassName
This will allocate a heap space of 1024 MB's (1 GB) for your program. You may request for heap space as per your requirement.

Java: Memory efficient ByteArrayOutputStream

I've got a 40MB file in the disk and I need to "map" it into memory using a byte array.
At first, I thought writing the file to a ByteArrayOutputStream would be the best way, but I find it takes about 160MB of heap space at some moment during the copy operation.
Does somebody know a better way to do this without using three times the file size of RAM?
Update: Thanks for your answers. I noticed I could reduce memory consumption a little telling ByteArrayOutputStream initial size to be a bit greater than the original file size (using the exact size with my code forces reallocation, got to check why).
There's another high memory spot: when I get byte[] back with ByteArrayOutputStream.toByteArray. Taking a look to its source code, I can see it is cloning the array:
public synchronized byte toByteArray()[] {
return Arrays.copyOf(buf, count);
}
I'm thinking I could just extend ByteArrayOutputStream and rewrite this method, so to return the original array directly. Is there any potential danger here, given the stream and the byte array won't be used more than once?

MappedByteBuffer might be what you're looking for.
I'm surprised it takes so much RAM to read a file in memory, though. Have you constructed the ByteArrayOutputStream with an appropriate capacity? If you haven't, the stream could allocate a new byte array when it's near the end of the 40 MB, meaning that you would, for example, have a full buffer of 39MB, and a new buffer of twice the size. Whereas if the stream has the appropriate capacity, there won't be any reallocation (faster), and no wasted memory.

ByteArrayOutputStream should be okay so long as you specify an appropriate size in the constructor. It will still create a copy when you call toByteArray, but that's only temporary. Do you really mind the memory briefly going up a lot?
Alternatively, if you already know the size to start with you can just create a byte array and repeatedly read from a FileInputStream into that buffer until you've got all the data.

If you really want to map the file into memory, then a FileChannel is the appropriate mechanism.
If all you want to do is read the file into a simple byte[] (and don't need changes to that array to be reflected back to the file), then simply reading into an appropriately-sized byte[] from a normal FileInputStream should suffice.
Guava has Files.toByteArray() which does all that for you.

For an explanation of the buffer growth behavior of ByteArrayOutputStream, please read this answer.
In answer to your question, it is safe to extend ByteArrayOutputStream. In your situation, it is probably better to override the write methods such that the maximum additional allocation is limited, say, to 16MB. You should not override the toByteArray to expose the protected buf[] member. This is because a stream is not a buffer; A stream is a buffer that has a position pointer and boundary protection. So, it is dangerous to access and potentially manipulate the buffer from outside the class.

I'm thinking I could just extend ByteArrayOutputStream and rewrite this method, so to return the original array directly. Is there any potential danger here, given the stream and the byte array won't be used more than once?
You shouldn't change the specified behavior of the existing method, but it's perfectly fine to add a new method. Here's an implementation:
/** Subclasses ByteArrayOutputStream to give access to the internal raw buffer. */
public class ByteArrayOutputStream2 extends java.io.ByteArrayOutputStream {
public ByteArrayOutputStream2() { super(); }
public ByteArrayOutputStream2(int size) { super(size); }
/** Returns the internal buffer of this ByteArrayOutputStream, without copying. */
public synchronized byte[] buf() {
return this.buf;
}
}
An alternative but hackish way of getting the buffer from any ByteArrayOutputStream is to use the fact that its writeTo(OutputStream) method passes the buffer directly to the provided OutputStream:
/**
* Returns the internal raw buffer of a ByteArrayOutputStream, without copying.
*/
public static byte[] getBuffer(ByteArrayOutputStream bout) {
final byte[][] result = new byte[1][];
try {
bout.writeTo(new OutputStream() {
#Override
public void write(byte[] buf, int offset, int length) {
result[0] = buf;
}
#Override
public void write(int b) {}
});
} catch (IOException e) {
throw new RuntimeException(e);
}
return result[0];
}
(That works, but I'm not sure if it's useful, given that subclassing ByteArrayOutputStream is simpler.)
However, from the rest of your question it sounds like all you want is a plain byte[] of the complete contents of the file. As of Java 7, the simplest and fastest way to do that is call Files.readAllBytes. In Java 6 and below, you can use DataInputStream.readFully, as in Peter Lawrey's answer. Either way, you will get an array that is allocated once at the correct size, without the repeated reallocation of ByteArrayOutputStream.

If you have 40 MB of data I don't see any reason why it would take more than 40 MB to create a byte[]. I assume you are using a growing ByteArrayOutputStream which creates a byte[] copy when finished.
You can try the old read the file at once approach.
File file =
DataInputStream is = new DataInputStream(FileInputStream(file));
byte[] bytes = new byte[(int) file.length()];
is.readFully(bytes);
is.close();
Using a MappedByteBuffer is more efficient and avoids a copy of data (or using the heap much) provided you can use the ByteBuffer directly, however if you have to use a byte[] its unlikely to help much.

... but I find it takes about 160MB of heap space at some moment during the copy operation
I find this extremely surprising ... to the extent that I have my doubts that you are measuring the heap usage correctly.
Let's assume that your code is something like this:
BufferedInputStream bis = new BufferedInputStream(
new FileInputStream("somefile"));
ByteArrayOutputStream baos = new ByteArrayOutputStream(); /* no hint !! */
int b;
while ((b = bis.read()) != -1) {
baos.write((byte) b);
}
byte[] stuff = baos.toByteArray();
Now the way that a ByteArrayOutputStream manages its buffer is to allocate an initial size, and (at least) double the buffer when it fills it up. Thus, in the worst case baos might use up to 80Mb buffer to hold a 40Mb file.
The final step allocates a new array of exactly baos.size() bytes to hold the buffer's contents. That's 40Mb. So the peak amount of memory that is actually in use should be 120Mb.
So where are those extra 40Mb being used? My guess is that they are not, and that you are actually reporting the total heap size, not the amount of memory that is occupied by reachable objects.
So what is the solution?
You could use a memory mapped buffer.
You could give a size hint when you allocate the ByteArrayOutputStream; e.g.
ByteArrayOutputStream baos = ByteArrayOutputStream(file.size());
You could dispense with the ByteArrayOutputStream entirely and read directly into a byte array.
byte[] buffer = new byte[file.size()];
FileInputStream fis = new FileInputStream(file);
int nosRead = fis.read(buffer);
/* check that nosRead == buffer.length and repeat if necessary */
Both options 1 and 2 should have an peak memory usage of 40Mb while reading a 40Mb file; i.e. no wasted space.
It would be helpful if you posted your code, and described your methodology for measuring memory usage.
I'm thinking I could just extend ByteArrayOutputStream and rewrite this method, so to return the original array directly. Is there any potential danger here, given the stream and the byte array won't be used more than once?
The potential danger is that your assumptions are incorrect, or become incorrect due to someone else modifying your code unwittingly ...

Google Guava ByteSource seems to be a good choice for buffering in memory. Unlike implementations like ByteArrayOutputStream or ByteArrayList(from Colt Library) it does not merge the data into a huge byte array but stores every chunk separately. An example:
List<ByteSource> result = new ArrayList<>();
try (InputStream source = httpRequest.getInputStream()) {
byte[] cbuf = new byte[CHUNK_SIZE];
while (true) {
int read = source.read(cbuf);
if (read == -1) {
break;
} else {
result.add(ByteSource.wrap(Arrays.copyOf(cbuf, read)));
}
}
}
ByteSource body = ByteSource.concat(result);
The ByteSource can be read as an InputStream anytime later:
InputStream data = body.openBufferedStream();

... came here with the same observation when reading a 1GB file: Oracle's ByteArrayOutputStream has a lazy memory management.
A byte-Array is indexed by an int and such anyway limited to 2GB. Without dependency on 3rd-party you might find this useful:
static public byte[] getBinFileContent(String aFile)
{
try
{
final int bufLen = 32768;
final long fs = new File(aFile).length();
final long maxInt = ((long) 1 << 31) - 1;
if (fs > maxInt)
{
System.err.println("file size out of range");
return null;
}
final byte[] res = new byte[(int) fs];
final byte[] buffer = new byte[bufLen];
final InputStream is = new FileInputStream(aFile);
int n;
int pos = 0;
while ((n = is.read(buffer)) > 0)
{
System.arraycopy(buffer, 0, res, pos, n);
pos += n;
}
is.close();
return res;
}
catch (final IOException e)
{
e.printStackTrace();
return null;
}
catch (final OutOfMemoryError e)
{
e.printStackTrace();
return null;
}
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java heap space error while reading file in byte array - java

Try to increase heap size allocated by the Java Virtual Machine (JVM), something like: java -Xms<initial heap size> -Xmx<maximum heap size> For example: java -Xms64m -Xmx256m HelloWorld

Donot create such a huge byte[] array. Your heap may go out of memory. It is bad idea to create byte[] array of file length for such a large file. create small byte array and read the file in chunk by chunk basis

need some jvm tuning java -Xms256m -Xmx1024m

Is there a particular reason you nee to read the whole file at once as a byte[]? Can you use a memory mapped ByteBuffer instead as this uses very little heap regardless of the size of the file.

Related

what is the better way to substring large text?

Java TCP-Sockets transmit files larger than 4gb [duplicate]

Java - download a file through network with a buffer

java.lang.OutOfMemoryError with Java Vector.addElement(Object o) method

Java: Memory efficient ByteArrayOutputStream

Categories

Resources