I am familiar with the concept of InputStream,buffers and why they are useful (when you need to work with data that might be larger then the machines RAM for example).
I was wondering though, how does the InputStream actually carry all that data?. Could a OutOfMemoryError be caused if there is TOO much data being transfered?
Case-scenario
If I connect from a client to a server,requesting a 100GB file, the server starts iterating through the bytes of the file with a buffer, and writing the bytes back to the client with outputStream.write(byte[]). The client is not ready to read the InputStream right now,for whatever reason. Will the server continue sending the bytes of the file indefinitely? and if so, won't the outputstream/inputstream be larger than the RAM of one of these machines?
InputStream and OutputStream implementations do not generally use a lot of memory. In fact, the word "Stream" in these types means that it does not need to hold the data, because it is accessed in a sequential manner -- in the same way that a stream can transfer water between a lake and the ocean without holding a lot of water itself.
But "stream" is not the best word to describe this. It's more like a pipe, because when you transfer data from a server to a client, every stage transfers back-pressure from the client that controls the rate at which data gets sent. This is similar to how your faucet controls the rate of flow through your pipes all the way to the city reservoir:
As the client reads data, it's InputStream only requests more data from the OS when its internal (small) buffers are empty. Each request allows only a limited amount of data to be transferred;
As data is requested from the OS, its own internal buffer empties, and it notifies the server about how much space there is for new data. The server can send only this much (that's called 'flow control' in TCP: https://en.wikipedia.org/wiki/Transmission_Control_Protocol#Resource_usage)
On the server side, the server-side OS sends out data from its own internal buffer when the client has space to receive it. As its own internal buffer empties, it allows the writing process to re-fill it with more data.
As the server-side process write()s to its OutputStream, the OutputStream will try to write data to the OS. When the OS buffer is full, it will make the server process wait until the server-side buffer has space to accept new data.
Notice that a slow client can make the server process take a very long time. If you're writing a server, and you don't control the clients, then it's very important to consider this and to ensure that there are not a lot of server-side resources tied up while a long data transfer takes place.
Your question is as interesting as difficult to answer properly.
First: InputStream and OutputStream are not a storage means, but an access means: They describe that the data shall be accessed in sequential, unidirectional order, but not how it shall be stored. The actual way of storing the data is implementation-dependent.
So, would there be an InputStream that stores the whole amount of data simultaneally in memory? Yes, could be, though it would be an appalling implementation. The most common and sensitive implementation of InputStreams / OutputStreams is by storing just a fixed and short amount of data into a temporary buffer of 4K-8K, for example.
(So far, I supposed you already knew that, but it was necessary to tell.)
Second: What about connected writting / reading streams between a server and a client? In a common scenario of buffered writting, the server will not write more data than the buffer allows. So, if the server starts writing, and the client then goes down (for whatever reason), the server will just keep writing until the buffer is full, and then set it as ready for reading, and until the read is not completed (by the client peer), the server won't fill the buffer again. Remember: This kind of read/write is blocking: The client blocks until there is a buffer ready to be read, and the server blocks (or, at least, the server thread bound to this connection, it's understood) until the last read is completed.
How many time will the server block? Typically, a server should have a security timeout to ensure that long blocks will break the connection, thus releasing the blocked thread. The same should have the client.
The timeouts set for the connection depend on the implementation, and the protocol.
No, it does not need to hold all data. I just advances forward in the file (usually using buffered data). The stream can discard old buffers as it pleases.
Note that there are a a lot of very different implementations of inputstreams, so the exact behaviour varies a lot.
Related
With reference to the stackoverflow question it is said that the InputStream can be read multiple times with mark() and reset() provided by the InputStream or by using PushbackInputStream.
In all these cases the content of the stream is stored in byte array (ie; the original content of the file is stored in main memory) and reused multiple times.
What happens when the size of the file exceeds the memory size? I think this may pave way for OutOfMemoryException.
Is there any better way to read the stream content multiple times without storing the stream content locally (ie; in main memory)?
Please help me knowing this. Thanks in advance.
It depends on the source of the stream.
If it's a local file, you can likely re-open and re-read the stream as many times as you want.
If it's dynamically generated by a process, a remote service, etc., you might not be free to re-generate it. In that case, you need to store it, either in memory or in some more persistent (and slow) storage like a file system or storage service.
Maybe an analogy would help. Suppose your friend is speaking to you at length. You listen carefully without interruption, but when they are done, you realize you didn't understand something they said near the beginning, and want to review that portion.
At this point, there are a few possibilities.
Perhaps your friend was actually reading aloud from a book. You can simply re-read the book.
Or, perhaps you had to foresight to record their monologue. You can replay the recording.
However, since neither you nor your friend has perfect and unlimited recall, simply repeating verbatim what was said ten minutes ago from memory alone is not an option.
An InputStream is like your friend speaking. Neither of you has a good enough memory to remember exactly, word-for-word, what is said. In the same way, neither a process that is generating the data stream nor your program has enough RAM to store, byte-for-byte, the stream. To scale, your program has to rely on its "short-term memory" (RAM), working with just a small portion of the whole stream at any given time, and "taking notes" (writing to a persistent store) as it encounters important points.
If the source of stream is a local file, then it's like your friend reading a book. Either of you can re-read that content easily enough.
If you copy the stream to some persistent storage, that's like recording your friend's speech. You can replay it as often as you like.
Consider a scenario where browser is uploading a large file, but the server is busy, and not able to read that stream for some time. Where is that data stored during that delay?
Because the receiver can't always respond immediately to input, TCP and many other protocols allocate a small buffer to store some data from a sender. But, they also have a way to tell the sender to wait, they are sending data too fast—flow control. Going back to the analogy, it's like telling your friend to pause a moment while you catch up with your note-taking.
As the browser uploads the file, at first, the buffer will be filled. But if the server can't keep up, the browser will be instructed to pause its upload until there is more room in the buffer. (This generally happens at the OS and TCP level; the client and server applications don't manage this directly.) The upload speed depends on how fast the browser can read the file from disk, how fast the network link is, and how fast the server can process the uploaded data. Even a fast network and client will be limited by the weak link in this chain.
This question concerns the design of a Java application for reading and processing large amounts of data from a few dozen UDP sockets, but I think it is relevant for other languages and environments.
I've seen network applications like the one described above have dedicated thread(s) for reading data off the socket buffer as quickly as possible, requeuing it inside the application and then processing it in a separate thread.
Is there anything wrong with leaving the data in the socket buffer until your processing thread is ready to receive the next piece of data? Is there any advantage to reading the data quickly and requeuing inside the application?
If the processing logic is not fast enough, the buffers will fill up. But if the processing logic is too slow to handle the inbound data, it seems like it does not matter where the data is queued. In case of a sudden spike in inbound data, the socket buffers should be large enough to handle it.
The buffer size for received UDP packets in the network stack is limited. If the buffer is full, some packets will be lost.
If the software handling UDP packets know, that it may need some time before it is able to process the packet, it makes sense to read the packet as soon as possible, relieving the network stack buffer and rather implement your own buffer or queue for the packets, in which they can be cached until processing resources are actually available.
What are the side effects of flushing a OutputStreamWriter which is going to a network socket.
I have a program which calls out.flush() after every few bytes. Is there any reason why I should wait until all bytes I need are in the buffer?
Will I get lower transfer rate if I flush too much (more overhead)?
Will this slow down execution of my program (blocking)?
Each time you write to a socket, you add between 5 and 15 micro-seconds. For buffered output, this occurs when you flush() the data. Note: if you don't have a buffered output, it will be performed on every write() and the flush() won't do anything.
Fortunately the OS expects applications to make more calls than is optimal so the it uses Nagle's algorithm by default to groups portions of the data writing into a larger packets. Note: not only does the OS do this but some network adapter do this by default too.
In short, don't flush() too often but unless tens of micros-seconds add up to something which matters to you, you might not notice the difference. e.g. if you do 100 flushes you might add a milli-second.
There is no reason to flush unless:
you want the peer to receive the data as soon as possible
you've sent a buffered request and you're now going to read the response.
In other cases it is better to allow the buffer, the Nagle algorithm, and the TCP receive window work their magic.
In case of data transfer on network, batching of outputStreamWriter's flush does improve performance.
I observed that single flush of data packet of about 520 bytes was taking around 3 milliseconds.
I have a cluster of servers (potentially remote from each other) which all run Tomcat and communicate over HTTP using Apache HttpClient. A large number of these servers are data stores, and one of the servers is a front-facing webserver that serves as an intermediary between the client and the stores. A user should be able to upload a file to the webserver and the webserver will pass that file to a given number of stores.
So, the question: is it possible to take the file part of the upload from the client as an InputStream and write to multiple POST requests to the stores at the same time? If I were simply writing to local files, the obvious solution would simply be to read chunks of the InputStream into a byte array buffer and write from the buffer to each of the outputs in turn, but I'm at a loss as to how to convince HttpClient to "share" a stream like this.
And yes, I could simply read the entire InputStream into an object on the webserver and write it out to each store sequentially, but since I could potentially be accepting very large files I'd have to write the data to disk and then read it back for each store server, and the number of disk operations could swiftly become prohibitive. This is an implementation I'd prefer to avoid.
If the stores do not have the network bandwidth to keep up, how would it "share" the stream?
You can split up the incoming file and pass it on to the stores without writing it to disk, but if just one of the stores cannot keep up, you'll have to keep that file data in memory until it can accept it. If it's a big file, or many users, it can potentially take all your memory.
More technically what I mean is that you can create 5 threads that will send data as fast as they can to the stores and keep the file data in a shared FIFO structure. When the last thread has accessed a portion and sent that portion, that data can be removed from the data structure, but not before. If one is slow, the data structure can grow huge.
The data has to be somewhere, if not memory and not hard drive, then where?
So, keep the incoming data in memory until (if?) you're running out of memory (never?), then flush it to the hard drive. Keep trying to empty the data structure with the data by getting it sent to the stores and then removing.
You can rather easily code an ExecutorService to handle the re-transmit of data and cleaning up the data structure, but it won't solve the problem magically. :)
I haven't provided source code, because you don't seem to want this solution. You might need help implementing it later if you accept that you can't magically pass the data on without there being some chance of having to buffer it on the hard drive (or a worse solution would be to throtte the user uploads to MinimumBandwidth(store1, store2, store3, store4, store5)).
Edit/changing:
I'm not sure you really want an ExecutorService even though I said that. I would create my own custom Thread's to handle this actually. I would create a Collection from the concurrent package, probably a LinkedBlockingQueue that holds byte arrays (not bytes, arrays of bytes). Then I would create a map from Thread->Integer that holds the current index for each thread's process in passing on the data. When all progress numbers are above say 10 (meaning all threads have sent the first 10 chunks), then I remove the first 10 byte arrays, and subtract 10 from all the thread's progress to reset it.
Create your own output stream. Attach as many HTTP POST Clients to this stream. If you receive Date to your output stream send it to each of the connected POST Clients.
I have a binary protocol with some unknown amount of initial header data (unknown length until header is fully decoded) followed by a stream of data that should be written to disk for later processing.
I have an implementation that decodes the header and then writes the file data to disk as it comes in from one or more frames in ChannelBuffers (so my handler subclasses FrameDecoder, and builds the message step by step, not waiting for one ChannelBuffer to contain the entire message, and writing the file data to disk with each frame). My concern is whether this is enough, or if using something like ChunkedWriteHandler does more than this and is necessary to handle large uploads.
Is there a more optimal way of handling file data than just writing it directly to disk from the ChannelBuffer with each frame?
It should be enough as long as the throughput is good enough. Otherwise, you might want to buffer the received data so that you don't make system calls too often (say, 32KiB bounded buffer) when the amount of received data is not large enough.
Netty could be even faster if it exposed the transferTo/From operation to a user, but such a feature is not available yet.
You should also think about adding an ExecutionHandler in front if your Handler. This will help you to not get blocked by the disk I/O. Otherwise you may see slow downs on heavy disk access.