Java: Handling large files in input streams

Java: Handling large files in input streams - java

I have a Java app that fetches a relatively small .zip file using a URL, saves it in a temp directory, unzips it onto the local machine and makes some changes to one of the files. This all works great.
However, I am accessing the .zip file via a BufferedInputStream in the following way:
Url url = "http://somedomain.com/file.zip";
InputStream is = new BufferedInputStream(url.openStream(), 1024);
My concern is that this app will actually be used to transfer very large zip files and I was wondering if a BufferedInputStream is actually the best way to do this, or whether I would just end up throwing some type of OutOfMemoryException?
So my question is, will a BufferedInputStream be suitable for this job, or should I be going about it in a completely different way?

BufferedInputStream doesn't load all the file into memory, it only uses an internal buffer, in your case of size 1024 bytes = 1kb. It never gets larger than that. You could actually increase the value if you aren't going to have many streams at once.
Edit: what you are thinking about maybe is a ByteArrayOutputStream, where data is saved in memory.

It depends on what you do with the content you read. If you read everything in memory, it will fail. If you write it to another stream, then it will be fine. Use BufferedInputStream

From the official Java Tutorials - Buffered Streams:
The Java platform implements buffered I/O streams. Buffered input
streams read data from a memory area known as a buffer; the native
input API is called only when the buffer is empty. Similarly, buffered
output streams write data to a buffer, and the native output API is
called only when the buffer is full.
There is another great SUN article.
So the answer is: BufferedInputStream is suitable for this kind of job in the sense of performance.
And yes, the memory consumption isn't so much dependent on the type of the input stream....

Related

How does Java Streaming read actually work?

How does java input streams actually work? For example when you call inputstream.read(), how does Java break the file down into packets? Does java care about whether the file is .mp3, .doc, .txt, .mov ? How does the java io actually break all these different file types down into packets which can be streamed?
I greatly appreciate any answers on this topic.

when you call inputstream.read(), how does Java break the file down into packets?
It doesn't. Files don't have packets.
Does java care about whether the file is .mp3, .doc, .txt, .mov ?
No.
How does the java io actually break all these different file types down into packets which can be streamed?
It doesn't. The files are byte-streams, and that is a property of the underlying resource and the operating system, not Java.

When reading single bytes from streams, the read() method blocks until data is available.
Some streams may fetch data in blocks rather than byte-wise, but the block size completely depends upon the implementation (reading from compressed streams, reading from encrypted streams based on block ciphers, ...).
You can ask the stream how many bytes can be read without blocking (InputStream.available()), if somehow you need to know if and how much is buffered.
Java also provides a BufferedInputStream class which wraps any stream and can do buffered reads. The buffer size can be specified (default is 8 kB).
When using file streams, the file type has no effect on the buffering behaviour. It's recommended to always use BufferedInputStream/BufferedOutputStream when reading from and writing to files.

Java - processing file in memory without the disk R/W

I am receiving files through a socket
and saving them to database.
So, i'm receiving the byte stream, and passing it
to a back-end process, say Process1
for the DB save.
I'm looking to do this without saving
the stream on disk. So, rather than storing the incoming stream
as a file on disk and then passing that file to Process1,
i'm looking to pass it while it's still in the memory.
This is to eliminate the time-costly disk read & write.
One way i can do is to pass the byte[] to Process1.
I'm wondering whether there's a better way of doing this.
TIA.

You can use a ByteArrayOutputStream. It is, essentially, a growable byte[] which you can write into at will, that is in the limit of your available heap space.
After having written to it/flushed it/closed it (although those two last operations are essentially a no-op, that's no reason for ditching sane practices), you can obtain the underlying byte array using this class's .toByteArray().

Socket sounds like what you are looking for.

Best way to use as little memory as possible when reading/writing large file?

I'm on mobile (android), and have a large text file, about 50mb. I want to be able to open the file and seek to a particular position, then start reading data into a buffer from that point. Is using FileReader + BufferedReader the best way to do this if I want to use as little memory as possible?:
BufferedReader in
= new BufferedReader(new FileReader("foo.txt"));
in.skip(byteCount); // in some cases I have to read from an offset
// start reading a line at a time here
I'll also need to write to the file, only ever appending data, so:
FileWriter w = new FileWriter("foo.txt", true);
w.write(someCharacters);
I'm primarily interested to know if by misusing the wrong file reader/writer classes, I may accidentally be loading the entire file contents into memory before the reads or writes,
Thanks

Basically you don't want to read the whole file, but just a certain portion of it. In this case use java.io.RandomAccessFile instead:
its seek() method is guaranteed to do seek instead of reading & discarding (which is what some implementations of InputStream.skip() actually do)
the seek() method can move back the file pointer - something you can't do for an InputStream
a getFilePointer() method is provided to get the current position in file.
it only reads what you tells it to read, so there's no fear you'll accidentally load more than what you want
My dictionary app uses RandomAccessFile to access about 45MB of data back when each Android app could only use 16MB of RAM, also a service running my dictionary engine that operates on the same 45MB of data uses only about 2MB of RAM(and most of it prob were used by Davlik VM and not my search engine). So this class definitely works as intended.

You could try using a memory mapped file (java.nio.channels.FileChannel.map()). I'm not sure how much heap space would be allocated for this though.

Avoid obtaining same InputStream more than once

I can see there are a number of posts regarding reuse InputStream. I understand InputStream is a one-time thing and cannot be reused.
However, I have a use case like this:
I have downloaded the file from DropBox by obtaining the DropBoxInputStream using the DropBox's Java SDK. I then need to upload the file to another system by passing the InputStream. However, as part of the download, I have to provide the MD5 of the file. So I have to read the file from the stream before uploading the file. Because the DropBoxInputStream I received can only be used once, I have to get another DropBoxInputStream after I have calculated the MD5 and before uploading the file. The procedure is like:
Get first DropBoxInputStream
Read from the DropBoxInputStream and calculate MD5
Get the second DropBoxInputStream
Upload file using the MD5 and the second DropBoxInputStream.
I am thinking that, if there are many way for me to "cache" or "backup" the InputStream before I calculate the MD5 so that I can save step 3 of obtaining the same DropBoxInputStream again?
Many thanks
EDIT:
Sorry I missed some information.
What I am currently doing is that I use a MD5DigestOutputStream to calculate MD5. I stream data across the MD5DigestOutputStream and save them locally as a temp file. Once the data goes through the MD5DigestOutputStream, it will calculate the MD5.
I then call a third party library to upload the file using the calculated md5 and a FileInputStream which reads from the temp file.
However, this requires huge disk space sometime and I want to remove the needs to use temp file. The library I use only accepts a MD5 and InputStream. This means I have to calculate the MD5 on my end. My plan is to use my MD5DigestOutputStream to write data to /dev/null (not keeping the file) so that I can calculate theMD5, and get the InputStream from DropBox again and pass that to the library I use. I assume the library will be able to get the file directly from DropBox without the need for me to cache the file either in the memory of at the disk. Will it work?

Input streams aren't really designed for creating copies or re-using, they're specifically for situations where you don't want to read off into a byte array and use array operations on that (this is especially useful when the whole array isn't available, as in, for e.g. socket comunication). You could buffer up into a byte array, which is the process of reading sections from the stream into a byte array buffer until you have enough information.
But that's unnecessary for calculating an md5. Notice that InputStream is abstract, so it needs be implemented in an extended class. It has many implementations- GZIPInputStream, fileinputstream etc. These are, in design pattern speak, decorators of the IO stream: they add extra functionality to the abstract base IO classes. For example, GZIPInputStream gzips up the stream.
So, what you need is a stream to do this for md5. There is, joyfully, a well documented similar thing: see this answer. So you should just be able to pass your dropbox input stream (as it will be itself an input stream) to create a new DigestInputStream, and then you can both take the md5 and continue to read as before.
Worried about type casting? The idea with decorators in Java is that, since the InputStream base class interfaces all the methods and 'beef' you need to do your IO, there's no harm in passing instances of objects inheriting from InputStream in the constructor of each stream implementation, and you can still do the same core IO.
Finally, I should probably answer your actual question- say you still want to "cache" or "backup" the stream anyway? Well, you could just write it to a byte array. This is well documented, but can become faff when your streams get more complicated. Alternatively, try looking at a PushbackInputStream. Here, you can easily write a function to read off n bytes, perform and operation on them, and then restore them to the stream. Generally good to avoid these implementations of streams in Java, as it's bad for memory use, but no worse than buffering everything up which you'd otherwise have to do.
Or, of course, I would have a go with DigestInputStream.
Hope this helps,
Best.

You don't need to open a new InputStream from DropBox.
Once you have read the file from DropBox, you have it locally. So it is either in memory (in a byte array) or you stored it in a local file. Now you can create an InputStream that reads the data from memory (ByteArrayInputStream) or disk (FileInputStream) in order to upload the file.
So instead of caching the InputStream (which you can't) you cache the contents (which you can).

Java loading binary files

Please show me the best/fast methods for:
1) Loading very small binary files into memory. For example icons;
2) Loading/reading very big binary files of size 512Mb+. Maybe i must use memory-mapped IO?
3) Your common choice when you do not want to think about size/speed but must do only thing: read all bytes into memory?
Thank you!!!
P.S. Sorry for maybe trivial question. Please do not close it;)
P.S.2. Mirror of analog question for C#;

For memory mapped files, java has a nio package: Memory Mapped Files
Check out byte stream class for small files:Byte Stream
Check out buffered I/O for larger files: Buffered Stream

The simplest way to read a small file into memory is:
// Make a file object from the path name
File file=new File("mypath");
// Find the size
int size=file.length();
// Create a buffer big enough to hold the file
byte[] contents=new byte[size];
// Create an input stream from the file object
FileInputStream in=new FileInutStream(file);
// Read it all
in.read(contents);
// Close the file
in.close();
In real life you'd need some try/catch blocks in case of I/O errors.
If you're reading a big file, I would strongly suggest NOT reading it all into memory at one time if it can possibly be avoided. Read it and process it in chunks. It's a very rare application that really needs to hold a 500MB file in memory all at once.
There is no such thing as memory-mapped I/O in Java. If that's what you need to do, you'd just have to create a really big byte array.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.