What is the buffer in ByteArrayOutputStream(int size) exactly?

What is the buffer in ByteArrayOutputStream(int size) exactly? - java

I understand what a buffer is when writing to a file - OS-file-writing (calling native API - one method call for one char) is costly, so many chars/bytes are collected in a buffer and a buffer is written to file with one OS API call.
But what buffer is meant here? And why?
ByteArrayOutputStream(int size) - Creates a new byte array output
stream, with a buffer capacity of the specified size, in bytes.
ByteArrayOutputStream() has 32 bytes buffer by default, that is why Apache Commons have exactly same class org.apache.commons.io.output.ByteArrayOutputStream different only by buffer size and mechanism: "The original implementation only allocates 32 bytes at the beginning. As this class is designed for heavy duty it starts at 1024 bytes. In contrast to the original it doesn't reallocate the whole memory block but allocates additional buffers. This way no buffers need to be garbage collected and the contents don't have to be copied to the new buffer. This class is designed to behave exactly like the original. "
Besides in ByteArrayInputStream(byte[] buf) as I understand that "buf" (buffer) is actually a source of data (bytes) to be fed into InputStream (ByteArrayInputStream emulates InputStream from byte array), so the word buffer here is confusing in my opinion.

This class implements an output stream in which the data is written into a byte array. The buffer automatically grows as data is written to it.
The two bold terms are synonymous. The buffer is the byte[] array that holds the bytes written to the stream.
The buffer size is analogous to the capacity of an ArrayList. If you write more than 32 bytes to the stream then it has to grow the buffer, which involves allocating a new array and copying the bytes from old to new. A default "capacity" of 32 is inefficient if you know you'll be writing more than that.

The Javadoc says:
This class implements an output stream in which the data is written into a byte array. The buffer automatically grows as data is written to it.
So in the space of two sentences, it has used two different terms. There are numerous other examples in the same doc.
On the one hand, this might be confusing if you don't know that they are referring to the same thing; it might be clearer if it said something like:
is written to a buffer, implemented as a byte array.
But I think it's a simple fact that, once you know (or assume, since this is quite a common thing) they refer to the same thing, is no longer especially confusing.

Related

Java: Efficiently converting an array of longs to an array of bytes

I have an array of longs I want to write to disk. The most efficient disk I/O functions take in byte arrays, for example:
FileOutputStream.write(byte[] b, int offset, int length)
...so I want to begin by converting my long[] to byte[] (8 bytes for each long). I'm struggling to find a clean way to do this.
Direct typecasting doesn't seem allowed:
ConversionTest.java:6: inconvertible types
found : long[]
required: byte[]
byte[] byteArray = (byte[]) longArray;
^
It's easy to do the conversion by iterating over the array, for example:
ByteBuffer bytes = ByteBuffer.allocate(longArray.length * (Long.SIZE/8));
for( long l: longArray )
{
bytes.putLong( l );
}
byte[] byteArray = bytes.array();
...however that seems far less efficient than simply treating the long[] as a series of bytes.
Interestingly, when reading the file, it's easy to "cast" from byte[] to longs using Buffers:
LongBuffer longs = ByteBuffer.wrap(byteArray).asLongBuffer();
...but I can't seem to find any functionality to go the opposite direction.
I understand there are endian considerations when converting from long to byte, but I believe I've already addressed those: I'm using the Buffer framework shown above, which defaults to big endian, regardless of native byte order.

No, there is not a trivial way to convert from a long[] to a byte[].
Your best option is likely to wrap your FileOutputStream with a BufferedOutputStream and then write out the individual byte values for each long (using bitwise operators).
Another option is to create a ByteBuffer and put your long values into the ByteBuffer and then write that to a FileChannel. This handles the endianness conversion for you, but makes the buffering more complicated.

Concerning the efficiency, many details will, in fact, hardly make a difference. The hard disk is by far the slowest part involved here, and in the time that it takes to write a single byte to the disk, you could have converted thousands or even millions of bytes to longs. Every performance test here will not tell you anything about the performance of the implementation, but about the performance of the hard disk. In doubt, one should make dedicated benchmarks comparing the different conversion strategies, and comparing the different writing methods, respectively.
Assuming that the main goal is a functionality that allows a convenient conversion and does not impose an unnecessary overhead, I'd like to propose the following approach:
One can create a ByteBuffer of sufficient size, view this as a LongBuffer, use the bulk LongBuffer#put(long[]) method (which takes care of endianness conversions, of necessary, and does this as efficient as it can be), and finally, write the original ByteBuffer (which is now filled with the long values) to the file, using a FileChannel.
Following this idea, I think that this method is convenient and (most likely) rather efficient:
private static void bulkAndChannel(String fileName, long longArray[])
{
ByteBuffer bytes =
ByteBuffer.allocate(longArray.length * Long.BYTES);
bytes.order(ByteOrder.nativeOrder()).asLongBuffer().put(longArray);
try (FileOutputStream fos = new FileOutputStream(fileName))
{
fos.getChannel().write(bytes);
}
catch (IOException e)
{
e.printStackTrace();
}
}
(Of course, one could argue about whether allocating a "large" buffer is the best idea. But thanks to the convenience methods of the Buffer classes, this could easily and with reasonable effort be modified to write "chunks" of data with an appropriate size, for the case that one really wants to write a huge array and the memory overhead of creating the corresponding ByteBuffer would be prohibitively large)

OP here.
I have thought of one approach: ByteBuffer.asLongBuffer() returns an instance of ByteBufferAsLongBufferB, a class which wraps ByteBuffer in an interface for treating the data as longs while properly managing endianness. I could extend ByteBufferAsLongBufferB, and add a method to return the raw byte buffer (which is protected).
But this seems so esoteric and convoluted I feel there must be an easier way. Either that, or something in my approach is flawed.

How can I get byteSize of String Array other than traversing the Array

I want to optimize my code by using ByteBuffer in place of String. What I am getting is String[]. I am doing formatting on each element of it.
e.g. String strAry[] = {"Help", "I", "am", "trapped", "in", "a", "fortune", "cookie", "factory"};
is my String array, I am writing content of it to a .csv file in
format "StrArray[0]";"StrArray[1]";"StrArray2";"StrArray[3]"; so on...
which is internally creating multiple Strings and this code is running into loop for hundreds n thousands of time some time.
I want to implement ByteBuffer. While creating
ByteBuffer bbuf = ByteBuffer.allocate(bufferSize); I need to specify buffer size here.
I dont want to iterate over each element of String [] to calculate its byteSize.
Any help is appreciated.

Couple of notes:
Data structure usage
I think you should be using CharBuffer and not ByteBuffer. CharBuffer is requiring the number of characters and not bytes.
Buffers from Java NIO are always used as buffers, that means there is a possibility that you will need to read into them multiple times.
If you need to have the whole content in memory, buffers are not the data structure for this use case.
You don't have to know the exact size for a buffer, the allocated size is the maximal capacity of the buffer.
StringBuilder is a mutable data structure for string processing. You might consider using it instead.
You don't have to know the exact size.
Computation of final size
might be done using Stream API (Java 8) or similar utility methods.

Reading certain number of bytes into ByteBuffer

I have a binary file of 10MB. I need to read it in chunks of different size (e.g 300, 273 bytes). For reading I use FileChannel and ByteBuffer. Right now for each iteration of reading I allocate new ByteBuffer of size, that I need to read.
Is there possible to allocate only once (lets say 200 KB) for ByteBuffer and read into it (300 , 273 bytes etc. )? I will not read more than 200KB at once. The entire file must be read.
UPD
public void readFile (FileChannel fc, int amountOfBytesToRead)
{
ByteBuffer bb= ByteBuffer.allocate(amountOfBytesToRead);
fc.read(bb);
bb.flip();
// do something with bytes
bb = null;
}
I can not read whole file at once due to memory constraints. That's why I performing reading in chunks. Efficiency is also very important (that is why I don't want to use my current approach with multiple allocations). Thanks

Declare several ByteBuffers of the sizes you need and use scatter-read: read(ByteBuffer[] dsts, ...).
Or forget about NIO and use DataInputStream,readFully(). If you put a BufferedInputStream underneath you won't suffer any performance loss: it may even be faster.

What is the fastest way to load a big 2D int array from a file?

I'm loading a 2D array from file, it's 15,000,000 * 3 ints big (it will be 40,000,000 * 3 eventually). Right now, I use dataInputStream.readInt() to sequentially read the ints. It takes ~15 seconds. Can I make it significantly (at least 3x) faster or is this about as fast as I can get?

Yes, you can. From benchmark of 13 different ways of reading files:
If you have to pick the fastest approach, it would be one of these:
FileChannel with a MappedByteBuffer and array reads.
FileChannel with a direct ByteBuffer and array reads.
FileChannel with a wrapped array ByteBuffer and direct array access.
For the best Java read performance, there are 4 things to remember:
Minimize I/O operations by reading an array at a time, not a byte at
a time. An 8 KB array is a good size (that's why it's a default value for BufferedInputStream).
Minimize method calls by getting data an array at a time, not a byte
at a time. Use array indexing to get at bytes in the array.
Minimize thread synchronization locks if you don't need thread
safety. Either make fewer method calls to a thread-safe class, or use
a non-thread-safe class like FileChannel and MappedByteBuffer.
Minimize data copying between the JVM/OS, internal buffers, and
application arrays. Use FileChannel with memory mapping, or a direct
or wrapped array ByteBuffer.

Map your file into memory!
Java 7 code:
FileChannel channel = FileChannel.open(Paths.get("/path/to/file"),
StandardOpenOption.READ);
ByteBuffer buf = channel.map(0, channel.size(),
FileChannel.MapMode.READ_ONLY);
// use buf
See here for more details.
If you use Java 6, you'll have to:
RandomAccessFile file = new RandomAccessFile("/path/to/file", "r");
FileChannel channel = file.getChannel();
// same thing to obtain buf
You can even use .asIntBuffer() on the buffer if you want. And you can read only what you actually need to read, when you need to read it. And it does not impact your heap.

How to initialize a ByteBuffer if you don't know how many bytes to allocate beforehand?

Is this:
ByteBuffer buf = ByteBuffer.allocate(1000);
...the only way to initialize a ByteBuffer?
What if I have no idea how many bytes I need to allocate..?
Edit: More details:
I'm converting one image file format to a TIFF file. The problem is the starting file format can be any size, but I need to write the data in the TIFF to little endian. So I'm reading the stuff I'm eventually going to print to the TIFF file into the ByteBuffer first so I can put everything in Little Endian, then I'm going to write it to the outfile. I guess since I know how long IFDs are, headers are, and I can probably figure out how many bytes in each image plane, I can just use multiple ByteBuffers during this whole process.

The types of places that you would use a ByteBuffer are generally the types of places that you would otherwise use a byte array (which also has a fixed size). With synchronous I/O you often use byte arrays, with asynchronous I/O, ByteBuffers are used instead.
If you need to read an unknown amount of data using a ByteBuffer, consider using a loop with your buffer and append the data to a ByteArrayOutputStream as you read it. When you are finished, call toByteArray() to get the final byte array.
Any time when you aren't absolutely sure of the size (or maximum size) of a given input, reading in a loop (possibly using a ByteArrayOutputStream, but otherwise just processing the data as a stream, as it is read) is the only way to handle it. Without some sort of loop, any remaining data will of course be lost.
For example:
final byte[] buf = new byte[4096];
int numRead;
// Use try-with-resources to auto-close streams.
try(
final FileInputStream fis = new FileInputStream(...);
final ByteArrayOutputStream baos = new ByteArrayOutputStream()
) {
while ((numRead = fis.read(buf)) > 0) {
baos.write(buf, 0, numRead);
}
final byte[] allBytes = baos.toByteArray();
// Do something with the data.
}
catch( final Exception e ) {
// Do something on failure...
}
If you instead wanted to write Java ints, or other things that aren't raw bytes, you can wrap your ByteArrayOutputStream in a DataOutputStream:
ByteArrayOutputStream baos = new ByteArrayOutputStream();
DataOutputStream dos = new DataOutputStream(baos);
while (thereAreMoreIntsFromSomewhere()) {
int someInt = getIntFromSomewhere();
dos.writeInt(someInt);
}
byte[] allBytes = baos.toByteArray();

Depends.
Library
Converting file formats tends to be a solved problem for most problem domains. For example:
Batik can transcode between various image formats (including TIFF).
Apache POI can convert between office spreadsheet formats.
Flexmark can generate HTML from Markdown.
The list is long. The first question should be, "What library can accomplish this task?" If performance is a consideration, your time is likely better spent optimising an existing package to meet your needs than writing yet another tool. (As a bonus, other people get to benefit from the centralised work.)
Known Quantities
Reading a file? Allocate file.size() bytes.
Copying a string? Allocate string.length() bytes.
Copying a TCP packet? Allocate 1500 bytes, for example.
Unknown Quantities
When the number of bytes is truly unknown, you can do a few things:
Make a guess.
Analyze example data sets to buffer; use the average length.
Example
Java's StringBuffer, unless otherwise instructed, uses an initial buffer size to hold 16 characters. Once the 16 characters are filled, a new, longer array is allocated, and then the original 16 characters copied. If the StringBuffer had an initial size of 1024 characters, then the reallocation would not happen as early or as often.
Optimization
Either way, this is probably a premature optimization. Typically you would allocate a set number of bytes when you want to reduce the number of internal memory reallocations that get executed.
It is unlikely that this will be the application's bottleneck.

The idea is that it's only a buffer - not the whole of the data. It's a temporary resting spot for data as you read a chunk, process it (possibly writing it somewhere else). So, allocate yourself a big enough "chunk" and it normally won't be a problem.
What problem are you anticipating?

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.