Memory usage of OutputStream for file downloading

Memory usage of OutputStream for file downloading - java

So I'm not really sure what actually happens deep-down.
Let's say we are in a web application and user request to download a dynamically generated file which can be several mb in size, possibly even 100 mb or more. The application does this
String disposition = "attachment; fileName="myFile.txt";
response.setHeader("Content-Disposition", disposition);
ServletOutputStream output = response.getOutputStream();
OutputStreamWriter writer = new OutputStreamWriter(output);
service.exportFile(ids, writer, properties);
Am I right that the whole file is never completely in memory? eg. whatever data was generated is sent to the user and then discarded on the server (assuming everything went well, no packet loss)?
I ask this because I need to change the library that generates the files (3rd party) and the new one doesn't not use standard Java IO stuff probably because it is just an API and the actual library is in C. Anyway to get a buffers data the documentation says to call
String data = buffer.toString();
(the files are ASCII)
So is my assumption correct that memory consumption will be affected especially when multiple users download large files at the same time?

Yes in you're first code snippet data is streamed directly to client, assuming, that implementation of service.exportFile(ids, writer, properties) itself never holds the generated data in memory but really streames it directly to writter.
With String data = buffer.toString(); you'll definitively place the entire data in heap space, at the latest when buffer.toString()is called, may be earlier depending on exact implementation.
In conclusion, you have in my opinion to be aware of two things:
- never alocate data to a variable in you're codem but diretly write it to output stream
- insure that implmentation also never holds the whole data in memory whike generating it

the 3rd party lib offers a second solution and that is writing to a file. however that requires extra hassle to create unique file names, then read them in and delete them afterwards. But I implemented it anyway.
In
service.exportFile(ids, writer, properties)
ids is a collection of database identifiers of records to be exported and hence ids.size() gives a rough estimate of the resulting files size. So as long as it is small I use the buffer version and if it is large the file one.

Related

Best way to use as little memory as possible when reading/writing large file?

I'm on mobile (android), and have a large text file, about 50mb. I want to be able to open the file and seek to a particular position, then start reading data into a buffer from that point. Is using FileReader + BufferedReader the best way to do this if I want to use as little memory as possible?:
BufferedReader in
= new BufferedReader(new FileReader("foo.txt"));
in.skip(byteCount); // in some cases I have to read from an offset
// start reading a line at a time here
I'll also need to write to the file, only ever appending data, so:
FileWriter w = new FileWriter("foo.txt", true);
w.write(someCharacters);
I'm primarily interested to know if by misusing the wrong file reader/writer classes, I may accidentally be loading the entire file contents into memory before the reads or writes,
Thanks

Basically you don't want to read the whole file, but just a certain portion of it. In this case use java.io.RandomAccessFile instead:
its seek() method is guaranteed to do seek instead of reading & discarding (which is what some implementations of InputStream.skip() actually do)
the seek() method can move back the file pointer - something you can't do for an InputStream
a getFilePointer() method is provided to get the current position in file.
it only reads what you tells it to read, so there's no fear you'll accidentally load more than what you want
My dictionary app uses RandomAccessFile to access about 45MB of data back when each Android app could only use 16MB of RAM, also a service running my dictionary engine that operates on the same 45MB of data uses only about 2MB of RAM(and most of it prob were used by Davlik VM and not my search engine). So this class definitely works as intended.

You could try using a memory mapped file (java.nio.channels.FileChannel.map()). I'm not sure how much heap space would be allocated for this though.

Avoid obtaining same InputStream more than once

I can see there are a number of posts regarding reuse InputStream. I understand InputStream is a one-time thing and cannot be reused.
However, I have a use case like this:
I have downloaded the file from DropBox by obtaining the DropBoxInputStream using the DropBox's Java SDK. I then need to upload the file to another system by passing the InputStream. However, as part of the download, I have to provide the MD5 of the file. So I have to read the file from the stream before uploading the file. Because the DropBoxInputStream I received can only be used once, I have to get another DropBoxInputStream after I have calculated the MD5 and before uploading the file. The procedure is like:
Get first DropBoxInputStream
Read from the DropBoxInputStream and calculate MD5
Get the second DropBoxInputStream
Upload file using the MD5 and the second DropBoxInputStream.
I am thinking that, if there are many way for me to "cache" or "backup" the InputStream before I calculate the MD5 so that I can save step 3 of obtaining the same DropBoxInputStream again?
Many thanks
EDIT:
Sorry I missed some information.
What I am currently doing is that I use a MD5DigestOutputStream to calculate MD5. I stream data across the MD5DigestOutputStream and save them locally as a temp file. Once the data goes through the MD5DigestOutputStream, it will calculate the MD5.
I then call a third party library to upload the file using the calculated md5 and a FileInputStream which reads from the temp file.
However, this requires huge disk space sometime and I want to remove the needs to use temp file. The library I use only accepts a MD5 and InputStream. This means I have to calculate the MD5 on my end. My plan is to use my MD5DigestOutputStream to write data to /dev/null (not keeping the file) so that I can calculate theMD5, and get the InputStream from DropBox again and pass that to the library I use. I assume the library will be able to get the file directly from DropBox without the need for me to cache the file either in the memory of at the disk. Will it work?

Input streams aren't really designed for creating copies or re-using, they're specifically for situations where you don't want to read off into a byte array and use array operations on that (this is especially useful when the whole array isn't available, as in, for e.g. socket comunication). You could buffer up into a byte array, which is the process of reading sections from the stream into a byte array buffer until you have enough information.
But that's unnecessary for calculating an md5. Notice that InputStream is abstract, so it needs be implemented in an extended class. It has many implementations- GZIPInputStream, fileinputstream etc. These are, in design pattern speak, decorators of the IO stream: they add extra functionality to the abstract base IO classes. For example, GZIPInputStream gzips up the stream.
So, what you need is a stream to do this for md5. There is, joyfully, a well documented similar thing: see this answer. So you should just be able to pass your dropbox input stream (as it will be itself an input stream) to create a new DigestInputStream, and then you can both take the md5 and continue to read as before.
Worried about type casting? The idea with decorators in Java is that, since the InputStream base class interfaces all the methods and 'beef' you need to do your IO, there's no harm in passing instances of objects inheriting from InputStream in the constructor of each stream implementation, and you can still do the same core IO.
Finally, I should probably answer your actual question- say you still want to "cache" or "backup" the stream anyway? Well, you could just write it to a byte array. This is well documented, but can become faff when your streams get more complicated. Alternatively, try looking at a PushbackInputStream. Here, you can easily write a function to read off n bytes, perform and operation on them, and then restore them to the stream. Generally good to avoid these implementations of streams in Java, as it's bad for memory use, but no worse than buffering everything up which you'd otherwise have to do.
Or, of course, I would have a go with DigestInputStream.
Hope this helps,
Best.

You don't need to open a new InputStream from DropBox.
Once you have read the file from DropBox, you have it locally. So it is either in memory (in a byte array) or you stored it in a local file. Now you can create an InputStream that reads the data from memory (ByteArrayInputStream) or disk (FileInputStream) in order to upload the file.
So instead of caching the InputStream (which you can't) you cache the contents (which you can).

Is it advisable to store large strings in memory, or repeatedly read a file?

Let's say I have various text/json/xml/whatever files (stored locally, in the assets directory), ranging in size from 20 - 500 KB. Assuming these files are going to be referenced frequently, throughout the application, is it better to:
A) Read the file once, the first time it's requested, and store the data in a variable
or
B) Read the file each time it's requested, grab the requested bit of information, and allow GC to clean up afterward?
Coming from web-dev, I generally use option (A), but I wonder if the storage limitation of mobile devices makes B preferred in this context (Android app development).
TYIA.

You can store your data into memory by compressing it.That it will reduce your memory footprint at any point of time.So this technique can be applicable to both PCs and mobile phones.Later on when you need the data, read and decompress it.So read the file once, then compress and store it in the memory.The following example uses GZIPOutputStream to compress a string.
public static String compress(String str){
ByteArrayOutputStream out = new ByteArrayOutputStream();
GZIPOutputStream gzip = new GZIPOutputStream(out);
gzip.write(str.getBytes());
gzip.close();
return out.toString("ISO-8859-1");
}

If the file is being requested frequently, definitely it's better to read the file once and store in cache.
You can also read this article titled How Google Taught me to Cache and Cash In in HighScalability website.

That depends on the total size of the files, accessing frequency, and your targeting customers. Although high-end phone got very large memory, the are many low-ends system which has fewer memory.
It might deserve to use some LRU cache to reach a balance.

Storing state in Java

Broad discussion question.
Are there any libraries already which allow me to store the state of execution of my application in Java?
E.g I have an application which processes files, now the application may be forced to shutdown suddenly at some point.I want to store the information on what all files have been processed and what all have not been, and what stage the processing was on for the ongoing processes.
Are there already any libraries which abstract this functionality or I would have to implement it from scratch?

It seems like what you are looking for is serialization which can be performed with the Java Serialization API.
You can write even less code if you decide to use known libraries such as Apache Commons Lang, and its SerializationUtils class which itself is built on top the Java Serialization API.
Using the latest, serializing/deserializing your application state into a file is done in a few lines.
The only thing you have to do is create a class holding your application state, let's call it... ApplicationState :-) It can look like that:
class ApplicationState {
enum ProcessState {
READ_DONE,
PROCESSING_STARTED,
PROCESSING_ENDED,
ANOTHER_STATE;
}
private List<String> filesDone, filesToDo;
private String currentlyProcessingFile;
private ProcessState currentProcessState;
}
With such a structure, and using SerializationUtils, serializing is done the following way:
try {
ApplicationState state = new ApplicationState();
...
// File to serialize object to
String fileName = "applicationState.ser";
// New file output stream for the file
FileOutputStream fos = new FileOutputStream(fileName);
// Serialize String
SerializationUtils.serialize(state, fos);
fos.close();
// Open FileInputStream to the file
FileInputStream fis = new FileInputStream(fileName);
// Deserialize and cast into String
String ser = (String) SerializationUtils.deserialize(fis);
System.out.println(ser);
fis.close();
} catch (Exception e) {
e.printStackTrace();
}

It sounds like the Java Preferences API might be a good option for you. This can store user/system settings with minimal effort on your part and you can update/retrieve at any time.
https://docs.oracle.com/javase/8/docs/technotes/guides/preferences/index.html

It's pretty simple to make from scratch. You could follow this:
Have a DB (or just a file) that stores the information of processing progress. Something like:
Id|fileName|status|metadata
As soon as you start processing a file make a entry to this table. Ans mark status as PROCESSING, the you can store intermediate states, and finally when you're done you can set status to DONE. This way, on restart, you would know what are the files processed; what are the files that were in-citu when the process shutdown/crashed. And (obviously) where to start.
In large enterprise environment where applications are loosely coupled (and there is no guarantee if the application will be available or might crash), we use Message Queue to do something like the same to ensure reliable architecture.

There are almost too many ways to mention. I would choice the option you believe is simplest.
You can use;
a file to record what is done (and what is to be done)
a persistent queue on JMS (which support multiple processes, even on different machine)
a embedded or remote database.
An approach I rave about is using memory mapped files. A nice feature is that information is not lost if the application dies or is killed (provided the OS doesn't crash) which means you don't have to flush it, nor worry about losing data if you don't.
This works because the data is partly managed by the OS which means it uses little heap (even for TB of data) and the OS deals with loading and flushing to disk making it much faster (and making sizes much larger than your main memory practical).
BTW: This approach works even with a kill -9 as the OS flushes the data to disk. To test this I use Unsafe.getByte(0) which crashes the application with a SEG fault immediately after making a change (as in the next machine code instruction) and it still writes the change to disk.
This won't work if you pull the power, but you have to be really quick. You can use memory mapped files to force the data to disk before continuing, but I don't know how you can test this really works. ;)
I have a library which could make memory mapped files easier to use
https://github.com/peter-lawrey/Java-Chronicle
Its a not long read and you can use it as an example.

Apache Commons Configuration API: http://commons.apache.org/proper/commons-configuration/userguide/howto_filebased.html#File-based_Configurations

temp files in memory in java program

Is there a way to force the temporary files created in a java program in memory? Since I use several large xml file, I would have advantages in this way? Should I use a transparent method that allows me to not upset the existing application.
UPDATE: I'm looking at the source code and I noticed that it uses libraries (I can not change) which requires the path of those files ...
Thanks

The only way I can think of is to create a RAM disk and then point the system property java.io.tmpdir to that RAM disk.

XML is just a String, why not just reference Strings in memory, I think the File interface is a distraction. Use StringBuilder if you need to manipulate the data. Use StringBuffer if you need thread safety. Put them in a type safe Map if you have a variable number of things that need to be looked up on with a key.
If you absolutely have to keep the File interface, then create a InMemoryFileWriter that wraps ByteArrayOutputStream and ByteArrayInputStream to keep them in memory, but again I think the whole File in memory thing is a bad decision if you just want to cache things in memory, that is a lot of overhead when a simple String would do.

Don't use files if you don't have to. Consider com.google.common.io.FileBackedOutputStream from Guava:
An OutputStream that starts buffering to a byte array, but switches to file buffering once the data reaches a configurable size.

You probably can force the default behaviour of java.io.File with some reflection magic, but I'm sure you don't want to do that as it can lead to unpredicted behaviour. You're better off providing a mechanism where it would be possible to switch between usual and in-memory behaviour, and route all calls via this mechanism.
Look at this example, it shows how to use file API to create in-memory files.

Assuming you have control over the the streams that are being used to write to the file -
Do you absolutely want the in-memory behavior? If all that you want to do is reduce the number of system calls to write to the disk, you can wrap the FileOutputStream in a BufferedOutputStream (with appropriately big buffer size) and write to this BufferedOutputStream (or BufferedWriter) instead of writing directly to the original FileOutputStream.
(This does require a change in the existing application)

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.