Java: Memory efficient ByteArrayOutputStream - java

I've got a 40MB file in the disk and I need to "map" it into memory using a byte array.
At first, I thought writing the file to a ByteArrayOutputStream would be the best way, but I find it takes about 160MB of heap space at some moment during the copy operation.
Does somebody know a better way to do this without using three times the file size of RAM?
Update: Thanks for your answers. I noticed I could reduce memory consumption a little telling ByteArrayOutputStream initial size to be a bit greater than the original file size (using the exact size with my code forces reallocation, got to check why).
There's another high memory spot: when I get byte[] back with ByteArrayOutputStream.toByteArray. Taking a look to its source code, I can see it is cloning the array:
public synchronized byte toByteArray()[] {
return Arrays.copyOf(buf, count);
}
I'm thinking I could just extend ByteArrayOutputStream and rewrite this method, so to return the original array directly. Is there any potential danger here, given the stream and the byte array won't be used more than once?

MappedByteBuffer might be what you're looking for.
I'm surprised it takes so much RAM to read a file in memory, though. Have you constructed the ByteArrayOutputStream with an appropriate capacity? If you haven't, the stream could allocate a new byte array when it's near the end of the 40 MB, meaning that you would, for example, have a full buffer of 39MB, and a new buffer of twice the size. Whereas if the stream has the appropriate capacity, there won't be any reallocation (faster), and no wasted memory.

ByteArrayOutputStream should be okay so long as you specify an appropriate size in the constructor. It will still create a copy when you call toByteArray, but that's only temporary. Do you really mind the memory briefly going up a lot?
Alternatively, if you already know the size to start with you can just create a byte array and repeatedly read from a FileInputStream into that buffer until you've got all the data.

If you really want to map the file into memory, then a FileChannel is the appropriate mechanism.
If all you want to do is read the file into a simple byte[] (and don't need changes to that array to be reflected back to the file), then simply reading into an appropriately-sized byte[] from a normal FileInputStream should suffice.
Guava has Files.toByteArray() which does all that for you.

For an explanation of the buffer growth behavior of ByteArrayOutputStream, please read this answer.
In answer to your question, it is safe to extend ByteArrayOutputStream. In your situation, it is probably better to override the write methods such that the maximum additional allocation is limited, say, to 16MB. You should not override the toByteArray to expose the protected buf[] member. This is because a stream is not a buffer; A stream is a buffer that has a position pointer and boundary protection. So, it is dangerous to access and potentially manipulate the buffer from outside the class.

I'm thinking I could just extend ByteArrayOutputStream and rewrite this method, so to return the original array directly. Is there any potential danger here, given the stream and the byte array won't be used more than once?
You shouldn't change the specified behavior of the existing method, but it's perfectly fine to add a new method. Here's an implementation:
/** Subclasses ByteArrayOutputStream to give access to the internal raw buffer. */
public class ByteArrayOutputStream2 extends java.io.ByteArrayOutputStream {
public ByteArrayOutputStream2() { super(); }
public ByteArrayOutputStream2(int size) { super(size); }
/** Returns the internal buffer of this ByteArrayOutputStream, without copying. */
public synchronized byte[] buf() {
return this.buf;
}
}
An alternative but hackish way of getting the buffer from any ByteArrayOutputStream is to use the fact that its writeTo(OutputStream) method passes the buffer directly to the provided OutputStream:
/**
* Returns the internal raw buffer of a ByteArrayOutputStream, without copying.
*/
public static byte[] getBuffer(ByteArrayOutputStream bout) {
final byte[][] result = new byte[1][];
try {
bout.writeTo(new OutputStream() {
#Override
public void write(byte[] buf, int offset, int length) {
result[0] = buf;
}
#Override
public void write(int b) {}
});
} catch (IOException e) {
throw new RuntimeException(e);
}
return result[0];
}
(That works, but I'm not sure if it's useful, given that subclassing ByteArrayOutputStream is simpler.)
However, from the rest of your question it sounds like all you want is a plain byte[] of the complete contents of the file. As of Java 7, the simplest and fastest way to do that is call Files.readAllBytes. In Java 6 and below, you can use DataInputStream.readFully, as in Peter Lawrey's answer. Either way, you will get an array that is allocated once at the correct size, without the repeated reallocation of ByteArrayOutputStream.

If you have 40 MB of data I don't see any reason why it would take more than 40 MB to create a byte[]. I assume you are using a growing ByteArrayOutputStream which creates a byte[] copy when finished.
You can try the old read the file at once approach.
File file =
DataInputStream is = new DataInputStream(FileInputStream(file));
byte[] bytes = new byte[(int) file.length()];
is.readFully(bytes);
is.close();
Using a MappedByteBuffer is more efficient and avoids a copy of data (or using the heap much) provided you can use the ByteBuffer directly, however if you have to use a byte[] its unlikely to help much.

... but I find it takes about 160MB of heap space at some moment during the copy operation
I find this extremely surprising ... to the extent that I have my doubts that you are measuring the heap usage correctly.
Let's assume that your code is something like this:
BufferedInputStream bis = new BufferedInputStream(
new FileInputStream("somefile"));
ByteArrayOutputStream baos = new ByteArrayOutputStream(); /* no hint !! */
int b;
while ((b = bis.read()) != -1) {
baos.write((byte) b);
}
byte[] stuff = baos.toByteArray();
Now the way that a ByteArrayOutputStream manages its buffer is to allocate an initial size, and (at least) double the buffer when it fills it up. Thus, in the worst case baos might use up to 80Mb buffer to hold a 40Mb file.
The final step allocates a new array of exactly baos.size() bytes to hold the buffer's contents. That's 40Mb. So the peak amount of memory that is actually in use should be 120Mb.
So where are those extra 40Mb being used? My guess is that they are not, and that you are actually reporting the total heap size, not the amount of memory that is occupied by reachable objects.
So what is the solution?
You could use a memory mapped buffer.
You could give a size hint when you allocate the ByteArrayOutputStream; e.g.
ByteArrayOutputStream baos = ByteArrayOutputStream(file.size());
You could dispense with the ByteArrayOutputStream entirely and read directly into a byte array.
byte[] buffer = new byte[file.size()];
FileInputStream fis = new FileInputStream(file);
int nosRead = fis.read(buffer);
/* check that nosRead == buffer.length and repeat if necessary */
Both options 1 and 2 should have an peak memory usage of 40Mb while reading a 40Mb file; i.e. no wasted space.
It would be helpful if you posted your code, and described your methodology for measuring memory usage.
I'm thinking I could just extend ByteArrayOutputStream and rewrite this method, so to return the original array directly. Is there any potential danger here, given the stream and the byte array won't be used more than once?
The potential danger is that your assumptions are incorrect, or become incorrect due to someone else modifying your code unwittingly ...

Google Guava ByteSource seems to be a good choice for buffering in memory. Unlike implementations like ByteArrayOutputStream or ByteArrayList(from Colt Library) it does not merge the data into a huge byte array but stores every chunk separately. An example:
List<ByteSource> result = new ArrayList<>();
try (InputStream source = httpRequest.getInputStream()) {
byte[] cbuf = new byte[CHUNK_SIZE];
while (true) {
int read = source.read(cbuf);
if (read == -1) {
break;
} else {
result.add(ByteSource.wrap(Arrays.copyOf(cbuf, read)));
}
}
}
ByteSource body = ByteSource.concat(result);
The ByteSource can be read as an InputStream anytime later:
InputStream data = body.openBufferedStream();

... came here with the same observation when reading a 1GB file: Oracle's ByteArrayOutputStream has a lazy memory management.
A byte-Array is indexed by an int and such anyway limited to 2GB. Without dependency on 3rd-party you might find this useful:
static public byte[] getBinFileContent(String aFile)
{
try
{
final int bufLen = 32768;
final long fs = new File(aFile).length();
final long maxInt = ((long) 1 << 31) - 1;
if (fs > maxInt)
{
System.err.println("file size out of range");
return null;
}
final byte[] res = new byte[(int) fs];
final byte[] buffer = new byte[bufLen];
final InputStream is = new FileInputStream(aFile);
int n;
int pos = 0;
while ((n = is.read(buffer)) > 0)
{
System.arraycopy(buffer, 0, res, pos, n);
pos += n;
}
is.close();
return res;
}
catch (final IOException e)
{
e.printStackTrace();
return null;
}
catch (final OutOfMemoryError e)
{
e.printStackTrace();
return null;
}
}

Related

How Buffer Streams works internally in Java

I'm reading about Buffer Streams. I searched about it and found many answers that clear my concepts but still have little more questions.
After searching, I have come to know that, Buffer is temporary memory(RAM) which helps program to read data quickly instead hard disk. and when Buffers empty then native input API is called.
After reading little more I got answer from here that is.
Reading data from disk byte-by-byte is very inefficient. One way to
speed it up is to use a buffer: instead of reading one byte at a time,
you read a few thousand bytes at once, and put them in a buffer, in
memory. Then you can look at the bytes in the buffer one by one.
I have two confusion,
1: How/Who data filled in Buffers? (native API how?) as quote above, who filled thousand bytes at once? and it will consume same time. Suppose I have 5MB data, and 5MB loaded once in Buffer in 5 Seconds. and then program use this data from buffer in 5 seconds. Total 10 seconds. But if I skip buffering, then program get direct data from hard disk in 1MB/2sec same as 10Sec total. Please clear my this confusion.
2: The second one how this line works
BufferedReader inputStream = new BufferedReader(new FileReader("xanadu.txt"));
As I'm thinking FileReader write data to buffer, then BufferedReader read data from buffer memory? Also explain this.
Thanks.
As for the performance of using buffering during read/write, it's probably minimal in impact since the OS will cache too, however buffering will reduce the number of calls to the OS, which will have an impact.
When you add other operations on top, such as character encoding/decoding or compression/decompression, the impact is greater as those operations are more efficient when done in blocks.
You second question said:
As I'm thinking FileReader write data to buffer, then BufferedReader read data from buffer memory? Also explain this.
I believe your thinking is wrong. Yes, technically the FileReader will write data to a buffer, but the buffer is not defined by the FileReader, it's defined by the caller of the FileReader.read(buffer) method.
The operation is initiated from outside, when some code calls BufferedReader.read() (any of the overloads). BufferedReader will then check it's buffer, and if enough data is available in the buffer, it will return the data without involving the FileReader. If more data is needed, the BufferedReader will call the FileReader.read(buffer) method to get the next chunk of data.
It's a pull operation, not a push, meaning the data is pulled out of the readers by the caller.
All the stuff is done by a private method named fill() i give you for educational purpose, but all java IDE let you see the source code yourself :
private void fill() throws IOException {
int dst;
if (markedChar <= UNMARKED) {
/* No mark */
dst = 0;
} else {
/* Marked */
int delta = nextChar - markedChar;
if (delta >= readAheadLimit) {
/* Gone past read-ahead limit: Invalidate mark */
markedChar = INVALIDATED;
readAheadLimit = 0;
dst = 0;
} else {
if (readAheadLimit <= cb.length) {
/* Shuffle in the current buffer */
// here copy the read chars in a memory buffer named cb
System.arraycopy(cb, markedChar, cb, 0, delta);
markedChar = 0;
dst = delta;
} else {
/* Reallocate buffer to accommodate read-ahead limit */
char ncb[] = new char[readAheadLimit];
System.arraycopy(cb, markedChar, ncb, 0, delta);
cb = ncb;
markedChar = 0;
dst = delta;
}
nextChar = nChars = delta;
}
}
int n;
do {
n = in.read(cb, dst, cb.length - dst);
} while (n == 0);
if (n > 0) {
nChars = dst + n;
nextChar = dst;
}
}

Java - download a file through network with a buffer

i want to read from a network stream and write the bytes to a file, directly.
But every time i run the program very few bytes are written to the file actually.
Java:
InputStream in = uc.getInputStream();
int clength=uc.getContentLength();
byte[] barr = new byte[clength];
int offset=0;
int totalwritten=0;
int i;
int wrote=0;
OutputStream out = new FileOutputStream("file.xlsx");
while(in.available()!=0) {
wrote=in.read(barr, offset, clength-offset);
out.write(barr, offset, wrote);
offset+=wrote;
totalwritten+=wrote;
}
System.out.println("Written: "+totalwritten+" of "+clength);
out.flush();
That's because available() doesn't do what you think it does. Read its API documentation. You should simply read until the number of bytes read, returned by read(), is -1. Or even simpler, use Files.copy():
Files.copy(in, new File("file.xlsx").toPath());
Using a buffer that has the size of the input stream also pretty much defeats the purpose of using a buffer, which is to only have a few bytes in memory.
If you want to reimplement copy(), the general pattern is the following:
byte[] buffer = new byte[4096]; // number of bytes in memory
int numberOfBytesRead;
while ((numberOfBytesRead = in.read(buffer)) >= 0) {
out.write(buffer, 0, numberOfBytesRead);
}
You're using .available() wrong. From Java documentation:
available() returns an estimate of the number of bytes that can be read
(or skipped over) from this input stream without blocking by the next
invocation of a method for this input stream
That means that the first time your stream is slower than your file writing speed (very soon in all probability) the while ends.
You should either prepare a thread that waits for the input until it has read all the expected content length (with a sizable timeout, of course) or just block your program in the wait, if user interaction is not a big deal.

Why doesn't InputStream fill the array fully?

Dude, I'm using following code to read up a large file(2MB or more) and do some business with data.
I have to read 128Byte for each data read call.
At the first I used this code(no problem,works good).
InputStream is;//= something...
int read=-1;
byte[] buff=new byte[128];
while(true){
for(int idx=0;idx<128;idx++){
read=is.read(); if(read==-1){return;}//end of stream
buff[idx]=(byte)read;
}
process_data(buff);
}
Then I tried this code which the problems got appeared(Error! weird responses sometimes)
InputStream is;//= something...
int read=-1;
byte[] buff=new byte[128];
while(true){
//ERROR! java doesn't read 128 bytes while it's available
if((read=is.read(buff,0,128))==128){process_data(buff);}else{return;}
}
The above code doesn't work all the time, I'm sure that number of data is available, but reads(read) 127 or 125, or 123, sometimes. what is the problem?
I also found a code for this to use DataInputStream#readFully(buff:byte[]):void which works too, but I'm just wondered why the seconds solution doesn't fill the array data while the data is available.
Thanks buddy.
Consulting the javadoc for FileInputStream (I'm assuming since you're reading from file):
Reads up to len bytes of data from this input stream into an array of bytes. If len is not zero, the method blocks until some input is available; otherwise, no bytes are read and 0 is returned.
The key here is that the method only blocks until some data is available. The returned value gives you how many bytes was actually read. The reason you may be reading less than 128 bytes could be due to a slow drive/implementation-defined behavior.
For a proper read sequence, you should check that read() does not equal -1 (End of stream) and write to a buffer until the correct amount of data has been read.
Example of a proper implementation of your code:
InputStream is; // = something...
int read;
int read_total;
byte[] buf = new byte[128];
// Infinite loop
while(true){
read_total = 0;
// Repeatedly perform reads until break or end of stream, offsetting at last read position in array
while((read = is.read(buf, read_total, buf.length - offset)) != -1){
// Gets the amount read and adds it to a read_total variable.
read_total = read_total + read;
// Break if it read_total is buffer length (128)
if(read_total == buf.length){
break;
}
}
if(read_total != buf.length){
// Incomplete read before 128 bytes
}else{
process_data(buf);
}
}
Edit:
Don't try to use available() as an indicator of data availability (sounds weird I know), again the javadoc:
Returns an estimate of the number of remaining bytes that can be read (or skipped over) from this input stream without blocking by the next invocation of a method for this input stream. Returns 0 when the file position is beyond EOF. The next invocation might be the same thread or another thread. A single read or skip of this many bytes will not block, but may read or skip fewer bytes.
In some cases, a non-blocking read (or skip) may appear to be blocked when it is merely slow, for example when reading large files over slow networks.
The key there is estimate, don't work with estimates.
Since the accepted answer was provided a new option has become available. Starting with Java 9, the InputStream class has two methods named readNBytes that eliminate the need for the programmer to write a read loop, for example your method could look like
public static void some_method( ) throws IOException {
InputStream is = new FileInputStream(args[1]);
byte[] buff = new byte[128];
while (true) {
int numRead = is.readNBytes(buff, 0, buff.length);
if (numRead == 0) {
break;
}
// The last read before end-of-stream may read fewer than 128 bytes.
process_data(buff, numRead);
}
}
or the slightly simpler
public static void some_method( ) throws IOException {
InputStream is = new FileInputStream(args[1]);
while (true) {
byte[] buff = is.readNBytes(128);
if (buff.length == 0) {
break;
}
// The last read before end-of-stream may read fewer than 128 bytes.
process_data(buff);
}
}

Java: StringBuffer to byte[] without toString

The title says it all. Is there any way to convert from StringBuilder to byte[] without using a String in the middle?
The problem is that I'm managing REALLY large strings (millions of chars), and then I have a cycle that adds a char in the end and obtains the byte[]. The process of converting the StringBuffer to String makes this cycle veryyyy very very slow.
Is there any way to accomplish this? Thanks in advance!
As many have already suggested, you can use the CharBuffer class, but allocating a new CharBuffer would only make your problem worse.
Instead, you can directly wrap your StringBuilder in a CharBuffer, since StringBuilder implements CharSequence:
Charset charset = StandardCharsets.UTF_8;
CharsetEncoder encoder = charset.newEncoder();
// No allocation performed, just wraps the StringBuilder.
CharBuffer buffer = CharBuffer.wrap(stringBuilder);
ByteBuffer bytes = encoder.encode(buffer);
EDIT: Duarte correctly points out that the CharsetEncoder.encode method may return a buffer whose backing array is larger than the actual data—meaning, its capacity is larger than its limit. It is necessary either to read from the ByteBuffer itself, or to read a byte array out of the ByteBuffer that is guaranteed to be the right size. In the latter case, there's no avoiding having two copies of the bytes in memory, albeit briefly:
ByteBuffer byteBuffer = encoder.encode(buffer);
byte[] array;
int arrayLen = byteBuffer.limit();
if (arrayLen == byteBuffer.capacity()) {
array = byteBuffer.array();
} else {
// This will place two copies of the byte sequence in memory,
// until byteBuffer gets garbage-collected (which should happen
// pretty quickly once the reference to it is null'd).
array = new byte[arrayLen];
byteBuffer.get(array);
}
byteBuffer = null;
If you're willing to replace the StringBuilder with something else, yet another possibility would be a Writer backed by a ByteArrayOutputStream:
ByteArrayOutputStream bout = new ByteArrayOutputStream();
Writer writer = new OutputStreamWriter(bout);
try {
writer.write("String A");
writer.write("String B");
} catch (IOException e) {
e.printStackTrace();
}
System.out.println(bout.toByteArray());
try {
writer.write("String C");
} catch (IOException e) {
e.printStackTrace();
}
System.out.println(bout.toByteArray());
As always, your mileage may vary.
For starters, you should probably be using StringBuilder, since StringBuffer has synchronization overhead that's usually unnecessary.
Unfortunately, there's no way to go directly to bytes, but you can copy the chars into an array or iterate from 0 to length() and read each charAt().
Unfortunately, the answers above that deal with ByteBuffer's array() method are a bit buggy... The trouble is that the allocated byte[] is likely to be bigger than what you'd expect. Thus, there will be trailing NULL bytes that are hard to get rid off, since you can't "re-size" arrays in Java.
Here is an article that explains this in more detail:
http://worldmodscode.wordpress.com/2012/12/14/the-java-bytebuffer-a-crash-course/
What are you trying to accomplish with "million of chars"? Are these logs that need to be parsed? Can you read it as just bytes and stick to a ByteBuffer? Then you can do:
buffer.array()
to get a byte[]
Depends on what it is you are doing, you can also use just a char[] or a CharBuffer:
CharBuffer cb = CharBuffer.allocate(4242);
cb.put("Depends on what it is you need to do");
...
Then you can get a char[] as:
cp.array()
It's always good to REPL things out, it's fun and proves the point. Java REPL is not something we are accustomed to, but hey, there is Clojure to save the day which speaks Java fluently:
user=> (import java.nio.CharBuffer)
java.nio.CharBuffer
user=> (def cb (CharBuffer/allocate 4242))
#'user/cb
user=> (-> (.put cb "There Be") (.array))
#<char[] [C#206564e9>
user=> (-> (.put cb " Dragons") (.array) (String.))
"There Be Dragons"
If you want performance, I wouldn't use StringBuilder or create a byte[]. Instead you can write progressively to the stream which will take the data in the first place. If you can't do that, you can copy the data from the StringBuilder to the Writer, but it much faster to not create the StringBuilder in the first place.

How write big endian ByteBuffer to little endian in Java

I currently have a Java ByteBuffer that already has the data in Big Endian format. I then want to write to a binary file as Little Endian.
Here's the code which just writes the file still in Big Endian:
public void writeBinFile(String fileName, boolean append) throws FileNotFoundException, IOException
{
FileOutputStream outStream = null;
try
{
outStream = new FileOutputStream(fileName, append);
FileChannel out = outStream.getChannel();
byteBuff.position(byteBuff.capacity());
byteBuff.flip();
byteBuff.order(ByteOrder.LITTLE_ENDIAN);
out.write(byteBuff);
}
finally
{
if (outStream != null)
{
outStream.close();
}
}
}
Note that byteBuff is a ByteBuffer that has been filled in Big Endian format.
My last resort is a brute force method of creating another buffer and setting that ByteBuffer to little endian and then reading the "getInt" values from the original (big endian) buffer, and "setInt" the value to the little endian buffer. I'd imagine there is a better way...
Endianess has no meaning for a byte[]. Endianess only matter for multi-byte data types like short, int, long, float, or double. The right time to get the endianess right is when you are writing the raw data to the bytes and reading the actual format.
If you have a byte[] given to you, you must decode the original data types and re-encode them with the different endianness. I am sure you will agree this is a) not easy to do or ideal b) cannot be done automagically.
Here is how I solved a similar problem, wanting to get the "endianness" of the Integers I'm writing to an output file correct:
byte[] theBytes = /* obtain a byte array that is the input */
ByteBuffer byteBuffer = ByteBuffer.wrap(theBytes);
ByteBuffer destByteBuffer = ByteBuffer.allocate(theBytes.length);
destByteBuffer.order(ByteOrder.LITTLE_ENDIAN);
IntBuffer destBuffer = destByteBuffer.asIntBuffer();
while (byteBuffer.hasRemaining())
{
int element = byteBuffer.getInt();
destBuffer.put(element);
/* Could write destBuffer int-by-int here, or outside this loop */
}
There might be more efficient ways to do this, but for my particular problem, I had to apply a mathematical transformation to the elements as I copied them to the new buffer. But this should still work for your particular problem.

Categories

Resources