Reading a file larger than 2GB into memory in Java

Reading a file larger than 2GB into memory in Java - java

Since ByteArrayInputStream is limited to 2GB, is there any alternate solution that allows me to store the whole contents of a 2.3GB (and possibly larger) file into an InputStream to be read by Stax2?
Current code:
XMLInputFactory xmlInputFactory = XMLInputFactory.newInstance();
XMLStreamReader xmlStreamReader = xmlInputFactory.createXMLStreamReader(in); //ByteArrayInputStream????
try
{
SchemaFactory factory = SchemaFactory.newInstance("http://www.w3.org/2001/XMLSchema");
Schema schema = factory.newSchema(new StreamSource(schemaInputStream));
Validator validator = schema.newValidator();
validator.validate(new StAXSource(xmlStreamReader));
}
finally
{
xmlStreamReader.close();
}
For performance tuning, variable in must not come from disk. I have plenties of RAM.

The whole point of StAX2 is that you do not need to read the file in to memory. You can just supply the source, and let the StAX StreamReader pull the data as it needs to.
What additional constraints do you have that you are not showing in your question?
If you have lots of memory, and you want to get good performance, just wrap your InputStream with a large byte buffer, and let the buffer do the buffering for you:
// 4 meg buffer on the stream
InputStream buffered = new BufferedInputStream(schemaInputStream, 1024 * 1024 * 4);
An alternative to solving this in Java is to create a RAMDisk, and to store the file on that, which would remove the problem from Java, where your basic limitation is that you can only have just less than Integer.MAX_VALUE values in a single array.

Use NIO to read the file into a gigantic ByteBuffer, and then create a stream class that reads the ByteBuffer. There are several such floating around in open sources.

If you have huge quantities of memory, you really won't get any performance improvement anyway. It's only getting read in once either way, and the disk cache will ensure it gets done optimally. Just use a disk-based input stream.

You can use memory writing the data compressed to a
ByteArrayOutputStream baos = new ByteArrayOutputStream
... new GZIPOutputStream(baos));
byte[] bytes = baos.toByteArray(); // < 100 MB?
ByteArrayInputStream ....
And then later wrap the input stream in a GZIPInputStream.
Still a minor slow down, but should be ideal for XML.

Related

Performance : BufferedOutputStream vs FileOutputStream in Java

I have read that BufferedOutputStream Class improves efficiency and must be used with FileOutputStream in this way -
BufferedOutputStream bout = new BufferedOutputStream(new FileOutputStream("myfile.txt"));
and for writing to the same file below statement is also works -
FileOutputStream fout = new FileOutputStream("myfile.txt");
But the recommended way is to use Buffer for reading / writing operations and that's the reason only I too prefer to use Buffer for the same.
But my question is how to measure performance of above 2 statements. Is their any tool or kind of something, don't know exactly what? but which will be useful to analyse it's performance.
As new to JAVA language, I am very curious to know about it.

Buffering is only helpful if you are doing inefficient reading or writing. For reading, it's helpful for letting you read line by line, even when you could gobble up bytes / chars faster just using read(byte[]) or read(char[]). For writing, it allows you to buffer pieces of what you want to send through I/O with the buffer, and to send them only on flush (see PrintWriter (PrintOutputStream(?).setAutoFlush())
But if you are just trying to read or write as fast as you can, buffering doesn't improve performance
For an example of efficient reading from a file:
File f = ...;
FileInputStream in = new FileInputStream(f);
byte[] bytes = new byte[(int) f.length()]; // file.length needs to be less than 4 gigs :)
in.read(bytes); // this isn't guaranteed by the API but I've found it works in every situation I've tried
Versus inefficient reading:
File f = ...;
BufferedReader in = new BufferedReader(f);
String line = null;
while ((line = in.readLine()) != null) {
// If every readline call was reading directly from the FS / Hard drive,
// it would slow things down tremendously. That's why having a buffer
//capture the file contents and effectively reading from the buffer is
//more efficient
}

These numbers came from a MacBook Pro laptop using an SSD.
BufferedFileStreamArrayBatchRead (809716.60-911577.03 bytes/ms)
BufferedFileStreamPerByte (136072.94 bytes/ms)
FileInputStreamArrayBatchRead (121817.52-1022494.89 bytes/ms)
FileInputStreamByteBufferRead (118287.20-1094091.90 bytes/ms)
FileInputStreamDirectByteBufferRead (130701.87-956937.80 bytes/ms)
FileInputStreamReadPerByte (1155.47 bytes/ms)
RandomAccessFileArrayBatchRead (120670.93-786782.06 bytes/ms)
RandomAccessFileReadPerByte (1171.73 bytes/ms)
Where there is a range in the numbers, it varies based on the size of the buffer being used. A larger buffer results in more speed up to a point, typically somewhere around the size of the caches within the hardware and operating system.
As you can see, reading bytes individually is always slow. Batching the reads into chunks is easily the way to go. It can be the difference between 1k per ms and 136k per ms (or more).
These numbers are a little old, and they will vary wildly by setup but they will give you an idea. The code for generating the numbers can be found here, edit Main.java to select the tests that you want to run.
An excellent (and more rigorous) framework for writing benchmarks is JMH. A tutorial for learning how to use JMH can be found here.

InputStream and OutOfMemory Error

public String loadJSONFromAsset(String path) {
String json = null;
try {
InputStream is = this.getAssets().open(path);
int size = is.available();
Log.d("Size: ", "" +size);
byte[] buffer = new byte[size];
is.read(buffer);
is.close();
json = new String(buffer, "UTF-8");
} catch (IOException ex) {
ex.printStackTrace();
}
return json;
}
This is code which convert the file to JSON data file. It works literally, it creates JSON file but the size of "is" is appr. 8MB
D/Size:: 7827533
and OutOfMemory Error occurs at most devices such as
java.lang.OutOfMemoryError
at java.lang.String.<init>(String.java:255)
at java.lang.String.<init>(String.java:228)
at com.example.fkn.projecttr.List.loadJSONFromAsset(List.java:255)
How can I handle it? How can it be coded more efficient? It has no problem running time but it consumes too large memory on device. Thus, when the device memory has no more capacity, program crashes out.

I noticed this:
int size = is.available();
and thought it was a little strange. So I went and looked at the JavaDoc for InputStream.available and this is what it had to say:
Note that while some implementations of InputStream will return the total number of bytes in the stream, many will not. It is never correct to use the return value of this method to allocate a buffer intended to hold all data in this stream.
So you have one of two conditions:
Your file size is actually 8MB.
If you really have this much JSON, you need to rethink what is in there and what you are using it for. One option that I don't see a lot of developers use is JsonReader, which allows you to parse through the JSON without loading the entire stream into memory first.
Your file size is much smaller than 8MB
Just read the file differently, see How do I create a Java string from the contents of a file?

Keep the bytes compressed, a GZipOutputStream on a ByteArrayOutputStream (or a gzipped compressed asset).
Then always process that, for output, parsing or whatever, using a GZipInputStream.
That would safe at least a factor 20 (10 compression, ~3 times less overhead).
For a less invasive change (and more memory consumption): now there is an 8 MB byte array and a 16 MB String. The string could be immediately parsed to a DOM, so JSON. Discarding the whitespace, and equal String values mapped to one String instance (for instance by a Map<String, String> idmap). It depends whether there is much repetition in the data.

How to clone input stream but still re-use original

I am trying to copy the InputStream from a URLConnection which is returning a stream of type HttpInputStream (inner class of HttpUrlConnection)
In other cases, I can copy the original stream to a ByteArrayOutputStream and then use mark/reset on the original, but HttpInputStream does not support mark/reset.
Is there a way I can still copy the stream and reset the original or keep it from being consumed? The original stream inside URLConnection has to be readable because it is passed into another library. I only need to copy the stream so I can read the first two lines of data. Here is what I have for streams that support mark/reset:
InputStream input = null;
ByteArrayOutputStream baos = new ByteArrayOutputStream();
try {
input = connection.getInputStream();
byte[] buffer = new byte[200];
input.mark(200);
int len = input.read(buffer);
input.reset();
baos.write(buffer, 0, len);
baos.flush();
String content = baos.toString("UTF-8");
//I set flags based on the value of content, but omitting here for the sake of simplicity.
} catch (IOException ex) {
//I do stuff here, but omitting for sake of simplicity in this
}

ImputStreams are not generally cloneable, and neither do all streams support mark/reset. There are some possible workarounds within the standard JRE.
Wrap the InputStream into a BufferedInputStream. That one supports mark/reset within the limits of its buffer size. That enables you to read a limited amount of data from the beginning, then reset the stream.
Another alternative is PushBackInputStream, which allows you to "unread" data previously read. You need to buffer the data to be pushed back yourself though, so it may be a bit inconvinient to handle.
If the whole stream isn't terribly big, you could also read the entire stream first, then construct as many ByteArrayInputStreams as needed from the pre-read data. Only feasible if the data fits in the heap (e.g. less than approximately 2GB max).

Apache commons library has a really nice TeeInput stream.
https://commons.apache.org/proper/commons-io/javadocs/api-1.4/org/apache/commons/io/input/TeeInputStream.html

Memory required by JVM for creating CSV files and zip it on the fly

I am creating two CSV files using String buffers and byte arrays.
I use ZipOutputStream to generate the zip files. Each csv file will have 20K records with 14 columns. Actually the records are fetched from DB and stored in ArrayList. I have to iterate the list and build StringBuffer and convert the StringBuffer to byte Array to wirte it to the zip entry.
I want to know the memory required by JVM to do the entire process starting from storing the records in the ArrayList.
I have provide the code snippet below.
StringBuffer responseBuffer = new StringBuffer();
String response = new String();
response = "Hello, sdksad, sfksdfjk, World, Date, ask, askdl, sdkldfkl, skldkl, sdfklklgf, sdlksldklk, dfkjsk, dsfjksj, dsjfkj, sdfjkdsfj\n";
for(int i=0;i<20000;i++){
responseBuffer.append(response);
}
response = responseBuffer.toString();
byte[] responseArray = response.getBytes();
res.setContentType("application/zip");
ZipOutputStream zout = new ZipOutputStream(res.getOutputStream());
ZipEntry parentEntry = new ZipEntry("parent.csv");
zout.putNextEntry(parentEntry);
zout.write(responseArray);
zout.closeEntry();
ZipEntry childEntry = new ZipEntry("child.csv");
zout.putNextEntry(childEntry);
zout.write(responseArray);
zout.closeEntry();
zout.close();
Please help me with this. Thanks in advance.

I'm guessing you've already tried counting how many bytes will be allocated to the StringBuffer and the byte array. But the problem is you can't really know how much memory your app will use unless you have upper bounds on the sizes of the CSV records. I'm If you want your software to be stable, robust and scalable, I'm afraid you're asking the wrong question: you should strive on performing the task you need to do using a fixed amount of memory, which in your case seems easily possible.
The key is, that in your case the processing is entirely FIFO - you read records from the database, and then write them (in the same order) into a FIFO stream (OutputStream in that case). Even zip compression is stream-based, and uses a fixed amount of memory internally, so you're totally safe there.
Instead of buffering the entire input in a huge String, then converting it to a huge byte array, then writing it to the output stream - you should read each response element separately from the database (or chunks of fixed size, say 100 records at a time), and write it to the output stream. Something like
res.setContentType("application/zip");
ZipOutputStream zout = new ZipOutputStream(res.getOutputStream());
ZipEntry parentEntry = new ZipEntry("parent.csv");
zout.putNextEntry(parentEntry);
while (... fetch entries ...)
zout.write(...data...)
zout.closeEntry();
The advantage of this approach is that because it works with small chunks you can easily estimate their sizes, and allocate enough memory for your JVM so it never crashes. And you know it will still work if your CSV files become much more than 20K lines in the future.

To analyze the memory usage you can use a Profiler.
JProfiler or YourKit is very good at doing this.
VisualVM is also good to an extent.

You can measure the memory with the MemoryTestbench.
http://www.javaspecialists.eu/archive/Issue029.html
This article desribes what to do. Its simple, and acurate to 1 byte, I often use it.
It even could be run form a junit test case, so its very usefull, while a profiler could not be run
from a junit test case.
With that apporach, you even can measure the memory size of one Integer object.
But with zip there is one special thing. Zipstream uses a native c library, in that case the MemoryTestbench may not measure that memory, only the java part.
You should try both variants, the MemroyTestbench, and with profilers (jprof).

How to initialize a ByteBuffer if you don't know how many bytes to allocate beforehand?

Is this:
ByteBuffer buf = ByteBuffer.allocate(1000);
...the only way to initialize a ByteBuffer?
What if I have no idea how many bytes I need to allocate..?
Edit: More details:
I'm converting one image file format to a TIFF file. The problem is the starting file format can be any size, but I need to write the data in the TIFF to little endian. So I'm reading the stuff I'm eventually going to print to the TIFF file into the ByteBuffer first so I can put everything in Little Endian, then I'm going to write it to the outfile. I guess since I know how long IFDs are, headers are, and I can probably figure out how many bytes in each image plane, I can just use multiple ByteBuffers during this whole process.

The types of places that you would use a ByteBuffer are generally the types of places that you would otherwise use a byte array (which also has a fixed size). With synchronous I/O you often use byte arrays, with asynchronous I/O, ByteBuffers are used instead.
If you need to read an unknown amount of data using a ByteBuffer, consider using a loop with your buffer and append the data to a ByteArrayOutputStream as you read it. When you are finished, call toByteArray() to get the final byte array.
Any time when you aren't absolutely sure of the size (or maximum size) of a given input, reading in a loop (possibly using a ByteArrayOutputStream, but otherwise just processing the data as a stream, as it is read) is the only way to handle it. Without some sort of loop, any remaining data will of course be lost.
For example:
final byte[] buf = new byte[4096];
int numRead;
// Use try-with-resources to auto-close streams.
try(
final FileInputStream fis = new FileInputStream(...);
final ByteArrayOutputStream baos = new ByteArrayOutputStream()
) {
while ((numRead = fis.read(buf)) > 0) {
baos.write(buf, 0, numRead);
}
final byte[] allBytes = baos.toByteArray();
// Do something with the data.
}
catch( final Exception e ) {
// Do something on failure...
}
If you instead wanted to write Java ints, or other things that aren't raw bytes, you can wrap your ByteArrayOutputStream in a DataOutputStream:
ByteArrayOutputStream baos = new ByteArrayOutputStream();
DataOutputStream dos = new DataOutputStream(baos);
while (thereAreMoreIntsFromSomewhere()) {
int someInt = getIntFromSomewhere();
dos.writeInt(someInt);
}
byte[] allBytes = baos.toByteArray();

Depends.
Library
Converting file formats tends to be a solved problem for most problem domains. For example:
Batik can transcode between various image formats (including TIFF).
Apache POI can convert between office spreadsheet formats.
Flexmark can generate HTML from Markdown.
The list is long. The first question should be, "What library can accomplish this task?" If performance is a consideration, your time is likely better spent optimising an existing package to meet your needs than writing yet another tool. (As a bonus, other people get to benefit from the centralised work.)
Known Quantities
Reading a file? Allocate file.size() bytes.
Copying a string? Allocate string.length() bytes.
Copying a TCP packet? Allocate 1500 bytes, for example.
Unknown Quantities
When the number of bytes is truly unknown, you can do a few things:
Make a guess.
Analyze example data sets to buffer; use the average length.
Example
Java's StringBuffer, unless otherwise instructed, uses an initial buffer size to hold 16 characters. Once the 16 characters are filled, a new, longer array is allocated, and then the original 16 characters copied. If the StringBuffer had an initial size of 1024 characters, then the reallocation would not happen as early or as often.
Optimization
Either way, this is probably a premature optimization. Typically you would allocate a set number of bytes when you want to reduce the number of internal memory reallocations that get executed.
It is unlikely that this will be the application's bottleneck.

The idea is that it's only a buffer - not the whole of the data. It's a temporary resting spot for data as you read a chunk, process it (possibly writing it somewhere else). So, allocate yourself a big enough "chunk" and it normally won't be a problem.
What problem are you anticipating?

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.