Reading an input stream twice without storing it in memory - java

With reference to the stackoverflow question it is said that the InputStream can be read multiple times with mark() and reset() provided by the InputStream or by using PushbackInputStream.
In all these cases the content of the stream is stored in byte array (ie; the original content of the file is stored in main memory) and reused multiple times.
What happens when the size of the file exceeds the memory size? I think this may pave way for OutOfMemoryException.
Is there any better way to read the stream content multiple times without storing the stream content locally (ie; in main memory)?
Please help me knowing this. Thanks in advance.

It depends on the source of the stream.
If it's a local file, you can likely re-open and re-read the stream as many times as you want.
If it's dynamically generated by a process, a remote service, etc., you might not be free to re-generate it. In that case, you need to store it, either in memory or in some more persistent (and slow) storage like a file system or storage service.
Maybe an analogy would help. Suppose your friend is speaking to you at length. You listen carefully without interruption, but when they are done, you realize you didn't understand something they said near the beginning, and want to review that portion.
At this point, there are a few possibilities.
Perhaps your friend was actually reading aloud from a book. You can simply re-read the book.
Or, perhaps you had to foresight to record their monologue. You can replay the recording.
However, since neither you nor your friend has perfect and unlimited recall, simply repeating verbatim what was said ten minutes ago from memory alone is not an option.
An InputStream is like your friend speaking. Neither of you has a good enough memory to remember exactly, word-for-word, what is said. In the same way, neither a process that is generating the data stream nor your program has enough RAM to store, byte-for-byte, the stream. To scale, your program has to rely on its "short-term memory" (RAM), working with just a small portion of the whole stream at any given time, and "taking notes" (writing to a persistent store) as it encounters important points.
If the source of stream is a local file, then it's like your friend reading a book. Either of you can re-read that content easily enough.
If you copy the stream to some persistent storage, that's like recording your friend's speech. You can replay it as often as you like.
Consider a scenario where browser is uploading a large file, but the server is busy, and not able to read that stream for some time. Where is that data stored during that delay?
Because the receiver can't always respond immediately to input, TCP and many other protocols allocate a small buffer to store some data from a sender. But, they also have a way to tell the sender to wait, they are sending data too fast—flow control. Going back to the analogy, it's like telling your friend to pause a moment while you catch up with your note-taking.
As the browser uploads the file, at first, the buffer will be filled. But if the server can't keep up, the browser will be instructed to pause its upload until there is more room in the buffer. (This generally happens at the OS and TCP level; the client and server applications don't manage this directly.) The upload speed depends on how fast the browser can read the file from disk, how fast the network link is, and how fast the server can process the uploaded data. Even a fast network and client will be limited by the weak link in this chain.

Related

How is InputStream managed in memory?

I am familiar with the concept of InputStream,buffers and why they are useful (when you need to work with data that might be larger then the machines RAM for example).
I was wondering though, how does the InputStream actually carry all that data?. Could a OutOfMemoryError be caused if there is TOO much data being transfered?
Case-scenario
If I connect from a client to a server,requesting a 100GB file, the server starts iterating through the bytes of the file with a buffer, and writing the bytes back to the client with outputStream.write(byte[]). The client is not ready to read the InputStream right now,for whatever reason. Will the server continue sending the bytes of the file indefinitely? and if so, won't the outputstream/inputstream be larger than the RAM of one of these machines?
InputStream and OutputStream implementations do not generally use a lot of memory. In fact, the word "Stream" in these types means that it does not need to hold the data, because it is accessed in a sequential manner -- in the same way that a stream can transfer water between a lake and the ocean without holding a lot of water itself.
But "stream" is not the best word to describe this. It's more like a pipe, because when you transfer data from a server to a client, every stage transfers back-pressure from the client that controls the rate at which data gets sent. This is similar to how your faucet controls the rate of flow through your pipes all the way to the city reservoir:
As the client reads data, it's InputStream only requests more data from the OS when its internal (small) buffers are empty. Each request allows only a limited amount of data to be transferred;
As data is requested from the OS, its own internal buffer empties, and it notifies the server about how much space there is for new data. The server can send only this much (that's called 'flow control' in TCP: https://en.wikipedia.org/wiki/Transmission_Control_Protocol#Resource_usage)
On the server side, the server-side OS sends out data from its own internal buffer when the client has space to receive it. As its own internal buffer empties, it allows the writing process to re-fill it with more data.
As the server-side process write()s to its OutputStream, the OutputStream will try to write data to the OS. When the OS buffer is full, it will make the server process wait until the server-side buffer has space to accept new data.
Notice that a slow client can make the server process take a very long time. If you're writing a server, and you don't control the clients, then it's very important to consider this and to ensure that there are not a lot of server-side resources tied up while a long data transfer takes place.
Your question is as interesting as difficult to answer properly.
First: InputStream and OutputStream are not a storage means, but an access means: They describe that the data shall be accessed in sequential, unidirectional order, but not how it shall be stored. The actual way of storing the data is implementation-dependent.
So, would there be an InputStream that stores the whole amount of data simultaneally in memory? Yes, could be, though it would be an appalling implementation. The most common and sensitive implementation of InputStreams / OutputStreams is by storing just a fixed and short amount of data into a temporary buffer of 4K-8K, for example.
(So far, I supposed you already knew that, but it was necessary to tell.)
Second: What about connected writting / reading streams between a server and a client? In a common scenario of buffered writting, the server will not write more data than the buffer allows. So, if the server starts writing, and the client then goes down (for whatever reason), the server will just keep writing until the buffer is full, and then set it as ready for reading, and until the read is not completed (by the client peer), the server won't fill the buffer again. Remember: This kind of read/write is blocking: The client blocks until there is a buffer ready to be read, and the server blocks (or, at least, the server thread bound to this connection, it's understood) until the last read is completed.
How many time will the server block? Typically, a server should have a security timeout to ensure that long blocks will break the connection, thus releasing the blocked thread. The same should have the client.
The timeouts set for the connection depend on the implementation, and the protocol.
No, it does not need to hold all data. I just advances forward in the file (usually using buffered data). The stream can discard old buffers as it pleases.
Note that there are a a lot of very different implementations of inputstreams, so the exact behaviour varies a lot.

how to optimize ZipOutputStream to use less ram memory [duplicate]

I am building a java server that needs to scale. One of the servlets will be serving images stored in Amazon S3.
Recently under load, I ran out of memory in my VM and it was after I added the code to serve the images so I'm pretty sure that streaming larger servlet responses is causing my troubles.
My question is : is there any best practice in how to code a java servlet to stream a large (>200k) response back to a browser when read from a database or other cloud storage?
I've considered writing the file to a local temp drive and then spawning another thread to handle the streaming so that the tomcat servlet thread can be re-used. This seems like it would be io heavy.
Any thoughts would be appreciated. Thanks.
When possible, you should not store the entire contents of a file to be served in memory. Instead, aquire an InputStream for the data, and copy the data to the Servlet OutputStream in pieces. For example:
ServletOutputStream out = response.getOutputStream();
InputStream in = [ code to get source input stream ];
String mimeType = [ code to get mimetype of data to be served ];
byte[] bytes = new byte[FILEBUFFERSIZE];
int bytesRead;
response.setContentType(mimeType);
while ((bytesRead = in.read(bytes)) != -1) {
out.write(bytes, 0, bytesRead);
}
// do the following in a finally block:
in.close();
out.close();
I do agree with toby, you should instead "point them to the S3 url."
As for the OOM exception, are you sure it has to do with serving the image data? Let's say your JVM has 256MB of "extra" memory to use for serving image data. With Google's help, "256MB / 200KB" = 1310. For 2GB "extra" memory (these days a very reasonable amount) over 10,000 simultaneous clients could be supported. Even so, 1300 simultaneous clients is a pretty large number. Is this the type of load you experienced? If not, you may need to look elsewhere for the cause of the OOM exception.
Edit - Regarding:
In this use case the images can contain sensitive data...
When I read through the S3 documentation a few weeks ago, I noticed that you can generate time-expiring keys that can be attached to S3 URLs. So, you would not have to open up the files on S3 to the public. My understanding of the technique is:
Initial HTML page has download links to your webapp
User clicks on a download link
Your webapp generates an S3 URL that includes a key that expires in, lets say, 5 minutes.
Send an HTTP redirect to the client with the URL from step 3.
The user downloads the file from S3. This works even if the download takes more than 5 minutes - once a download starts it can continue through completion.
Why wouldn't you just point them to the S3 url? Taking an artifact from S3 and then streaming it through your own server to me defeats the purpose of using S3, which is to offload the bandwidth and processing of serving the images to Amazon.
I've seen a lot of code like john-vasilef's (currently accepted) answer, a tight while loop reading chunks from one stream and writing them to the other stream.
The argument I'd make is against needless code duplication, in favor of using Apache's IOUtils. If you are already using it elsewhere, or if another library or framework you're using is already depending on it, it's a single line that is known and well-tested.
In the following code, I'm streaming an object from Amazon S3 to the client in a servlet.
import java.io.InputStream;
import java.io.OutputStream;
import org.apache.commons.io.IOUtils;
InputStream in = null;
OutputStream out = null;
try {
in = object.getObjectContent();
out = response.getOutputStream();
IOUtils.copy(in, out);
} finally {
IOUtils.closeQuietly(in);
IOUtils.closeQuietly(out);
}
6 lines of a well-defined pattern with proper stream closing seems pretty solid.
toby is right, you should be pointing straight to S3, if you can. If you cannot, the question is a little vague to give an accurate response:
How big is your java heap? How many streams are open concurrently when you run out of memory?
How big is your read write/bufer (8K is good)?
You are reading 8K from the stream, then writing 8k to the output, right? You are not trying to read the whole image from S3, buffer it in memory, then sending the whole thing at once?
If you use 8K buffers, you could have 1000 concurrent streams going in ~8Megs of heap space, so you are definitely doing something wrong....
BTW, I did not pick 8K out of thin air, it is the default size for socket buffers, send more data, say 1Meg, and you will be blocking on the tcp/ip stack holding a large amount of memory.
I agree strongly with both toby and John Vasileff--S3 is great for off loading large media objects if you can tolerate the associated issues. (An instance of own app does that for 10-1000MB FLVs and MP4s.) E.g.: No partial requests (byte range header), though. One has to handle that 'manually', occasional down time, etc..
If that is not an option, John's code looks good. I have found that a byte buffer of 2k FILEBUFFERSIZE is the most efficient in microbench marks. Another option might be a shared FileChannel. (FileChannels are thread-safe.)
That said, I'd also add that guessing at what caused an out of memory error is a classic optimization mistake. You would improve your chances of success by working with hard metrics.
Place -XX:+HeapDumpOnOutOfMemoryError into you JVM startup parameters, just in case
take use jmap on the running JVM (jmap -histo <pid>) under load
Analyize the metrics (jmap -histo out put, or have jhat look at your heap dump). It very well may be that your out of memory is coming from somewhere unexpected.
There are of course other tools out there, but jmap & jhat come with Java 5+ 'out of the box'
I've considered writing the file to a local temp drive and then spawning another thread to handle the streaming so that the tomcat servlet thread can be re-used. This seems like it would be io heavy.
Ah, I don't think you can't do that. And even if you could, it sounds dubious. The tomcat thread that is managing the connection needs to in control. If you are experiencing thread starvation then increase the number of available threads in ./conf/server.xml. Again, metrics are the way to detect this--don't just guess.
Question: Are you also running on EC2? What are your tomcat's JVM start up parameters?
You have to check two things:
Are you closing the stream? Very important
Maybe you're giving stream connections "for free". The stream is not large, but many many streams at the same time can steal all your memory. Create a pool so that you cannot have a certain number of streams running at the same time
In addition to what John suggested, you should repeatedly flush the output stream. Depending on your web container, it is possible that it caches parts or even all of your output and flushes it at-once (for example, to calculate the Content-Length header). That would burn quite a bit of memory.
If you can structure your files so that the static files are separate and in their own bucket, the fastest performance today can likely be achieved by using the Amazon S3 CDN, CloudFront.

HttpClient - write from one InputStream to multiple POST requests

I have a cluster of servers (potentially remote from each other) which all run Tomcat and communicate over HTTP using Apache HttpClient. A large number of these servers are data stores, and one of the servers is a front-facing webserver that serves as an intermediary between the client and the stores. A user should be able to upload a file to the webserver and the webserver will pass that file to a given number of stores.
So, the question: is it possible to take the file part of the upload from the client as an InputStream and write to multiple POST requests to the stores at the same time? If I were simply writing to local files, the obvious solution would simply be to read chunks of the InputStream into a byte array buffer and write from the buffer to each of the outputs in turn, but I'm at a loss as to how to convince HttpClient to "share" a stream like this.
And yes, I could simply read the entire InputStream into an object on the webserver and write it out to each store sequentially, but since I could potentially be accepting very large files I'd have to write the data to disk and then read it back for each store server, and the number of disk operations could swiftly become prohibitive. This is an implementation I'd prefer to avoid.
If the stores do not have the network bandwidth to keep up, how would it "share" the stream?
You can split up the incoming file and pass it on to the stores without writing it to disk, but if just one of the stores cannot keep up, you'll have to keep that file data in memory until it can accept it. If it's a big file, or many users, it can potentially take all your memory.
More technically what I mean is that you can create 5 threads that will send data as fast as they can to the stores and keep the file data in a shared FIFO structure. When the last thread has accessed a portion and sent that portion, that data can be removed from the data structure, but not before. If one is slow, the data structure can grow huge.
The data has to be somewhere, if not memory and not hard drive, then where?
So, keep the incoming data in memory until (if?) you're running out of memory (never?), then flush it to the hard drive. Keep trying to empty the data structure with the data by getting it sent to the stores and then removing.
You can rather easily code an ExecutorService to handle the re-transmit of data and cleaning up the data structure, but it won't solve the problem magically. :)
I haven't provided source code, because you don't seem to want this solution. You might need help implementing it later if you accept that you can't magically pass the data on without there being some chance of having to buffer it on the hard drive (or a worse solution would be to throtte the user uploads to MinimumBandwidth(store1, store2, store3, store4, store5)).
Edit/changing:
I'm not sure you really want an ExecutorService even though I said that. I would create my own custom Thread's to handle this actually. I would create a Collection from the concurrent package, probably a LinkedBlockingQueue that holds byte arrays (not bytes, arrays of bytes). Then I would create a map from Thread->Integer that holds the current index for each thread's process in passing on the data. When all progress numbers are above say 10 (meaning all threads have sent the first 10 chunks), then I remove the first 10 byte arrays, and subtract 10 from all the thread's progress to reset it.
Create your own output stream. Attach as many HTTP POST Clients to this stream. If you receive Date to your output stream send it to each of the connected POST Clients.

How to handle a large stream that is mostly written to disk in netty (use ChunkedWriteHandler?)?

I have a binary protocol with some unknown amount of initial header data (unknown length until header is fully decoded) followed by a stream of data that should be written to disk for later processing.
I have an implementation that decodes the header and then writes the file data to disk as it comes in from one or more frames in ChannelBuffers (so my handler subclasses FrameDecoder, and builds the message step by step, not waiting for one ChannelBuffer to contain the entire message, and writing the file data to disk with each frame). My concern is whether this is enough, or if using something like ChunkedWriteHandler does more than this and is necessary to handle large uploads.
Is there a more optimal way of handling file data than just writing it directly to disk from the ChannelBuffer with each frame?
It should be enough as long as the throughput is good enough. Otherwise, you might want to buffer the received data so that you don't make system calls too often (say, 32KiB bounded buffer) when the amount of received data is not large enough.
Netty could be even faster if it exposed the transferTo/From operation to a user, but such a feature is not available yet.
You should also think about adding an ExecutionHandler in front if your Handler. This will help you to not get blocked by the disk I/O. Otherwise you may see slow downs on heavy disk access.

Converse of java FileDescriptor .sync() for *reading* files

Reading the javadoc on FileDesciptor's .sync() method, it is apparent that sync() is primarily concerned with committing any modified buffers back to the underlying storage. I.e., making sure that anything that your program has output will actually make it to the disk (or socket or what-have-you, but my question pertains mainly to disks).
But what about the other direction, what about INPUT? Suppose my program has some parts of a java.io.RandomAccessFile buffered in memory, and I want to READ those parts of the file, but perhaps some other process has modified those parts of the file since the last time my program read those blocks?
This is akin to marking a variable as 'volatile' in a C program; something else may have changed the 'real version' of something you merely have a convenient copy of.
I.e., how can you be certain that what your java program reads is at least reasonably up-to-date?
(Clearly the definition of 'up to date' matters. Purely as an example, suppose that the other process, the one that writes to the file, does so on the order of maybe once per second, and suppose that the reading process reads maybe once per minute. In a situation like this, performance isn't a big deal, it's just a matter of making sure that what the reader reads is consistent with what the write writes, to within say, a second.)
Before re-reading your file, it is usually a good idea to check the last modified timestamp of the file with File.lastModified(). If this timestamp is not newer than the last time you read the file, you don't need to bother with more disk I/O to re-read the blocks you are interested in. One thing to keep in mind though, is that the last modifed timestamp may not always be updated immediately when the contents are updated if you are using a network filesystem. If you are dealing with a local process updating the file and another local process running your code reading the file, you most likely won't run into this issue.
One method I've had success with in the past was to have a separate thread poll the file for the last modified timestamp on certain intervals, say 5 seconds. If the file changed, re-process the file and send an event to registered listeners. In my case, 5 seconds was more than soon enough to get updates.
At the moment where the file is read into the internal buffer, the contents are up-to-date to the contents on the disk.
If you want to be sure to have the latest contents on your next access, you also have to go to the disk again, skipping all internal buffers and caches. If you really want to be sure, that all such layers are skipped, you'll have to reopen the file from scratch and seek to the according position you want to access.
Of course, your performance will go down the tubes if you access the disk on every possible access of the data. Don't think of 3-5 fold or so but orders of magnitudes.
If another program you control is the only one writing to the file, then it's probably best to have 2 threads in the same Java process coordinate. The easiest solution is to create a java.util.concurrrent.atomic.AtomicBoolean. The writer thread calls set(true) on the AtomicBoolean and the reader calls getAndSet(false). If getAndSet() returns true, then you know the reader needs to re-read the data. If it's an issue, you could synchronize on some object to prevent the writer from writing while the reader is reading.
You said "process" in the question, so maybe you are concerned about any other process on the system changing the data. In this case, I think you best bet is to just reopen and reread the data. The performance impact of this should be negligible if you really are only reading once per minute.

Categories

Resources