I have a cluster of servers (potentially remote from each other) which all run Tomcat and communicate over HTTP using Apache HttpClient. A large number of these servers are data stores, and one of the servers is a front-facing webserver that serves as an intermediary between the client and the stores. A user should be able to upload a file to the webserver and the webserver will pass that file to a given number of stores.
So, the question: is it possible to take the file part of the upload from the client as an InputStream and write to multiple POST requests to the stores at the same time? If I were simply writing to local files, the obvious solution would simply be to read chunks of the InputStream into a byte array buffer and write from the buffer to each of the outputs in turn, but I'm at a loss as to how to convince HttpClient to "share" a stream like this.
And yes, I could simply read the entire InputStream into an object on the webserver and write it out to each store sequentially, but since I could potentially be accepting very large files I'd have to write the data to disk and then read it back for each store server, and the number of disk operations could swiftly become prohibitive. This is an implementation I'd prefer to avoid.
If the stores do not have the network bandwidth to keep up, how would it "share" the stream?
You can split up the incoming file and pass it on to the stores without writing it to disk, but if just one of the stores cannot keep up, you'll have to keep that file data in memory until it can accept it. If it's a big file, or many users, it can potentially take all your memory.
More technically what I mean is that you can create 5 threads that will send data as fast as they can to the stores and keep the file data in a shared FIFO structure. When the last thread has accessed a portion and sent that portion, that data can be removed from the data structure, but not before. If one is slow, the data structure can grow huge.
The data has to be somewhere, if not memory and not hard drive, then where?
So, keep the incoming data in memory until (if?) you're running out of memory (never?), then flush it to the hard drive. Keep trying to empty the data structure with the data by getting it sent to the stores and then removing.
You can rather easily code an ExecutorService to handle the re-transmit of data and cleaning up the data structure, but it won't solve the problem magically. :)
I haven't provided source code, because you don't seem to want this solution. You might need help implementing it later if you accept that you can't magically pass the data on without there being some chance of having to buffer it on the hard drive (or a worse solution would be to throtte the user uploads to MinimumBandwidth(store1, store2, store3, store4, store5)).
Edit/changing:
I'm not sure you really want an ExecutorService even though I said that. I would create my own custom Thread's to handle this actually. I would create a Collection from the concurrent package, probably a LinkedBlockingQueue that holds byte arrays (not bytes, arrays of bytes). Then I would create a map from Thread->Integer that holds the current index for each thread's process in passing on the data. When all progress numbers are above say 10 (meaning all threads have sent the first 10 chunks), then I remove the first 10 byte arrays, and subtract 10 from all the thread's progress to reset it.
Create your own output stream. Attach as many HTTP POST Clients to this stream. If you receive Date to your output stream send it to each of the connected POST Clients.
Related
I am familiar with the concept of InputStream,buffers and why they are useful (when you need to work with data that might be larger then the machines RAM for example).
I was wondering though, how does the InputStream actually carry all that data?. Could a OutOfMemoryError be caused if there is TOO much data being transfered?
Case-scenario
If I connect from a client to a server,requesting a 100GB file, the server starts iterating through the bytes of the file with a buffer, and writing the bytes back to the client with outputStream.write(byte[]). The client is not ready to read the InputStream right now,for whatever reason. Will the server continue sending the bytes of the file indefinitely? and if so, won't the outputstream/inputstream be larger than the RAM of one of these machines?
InputStream and OutputStream implementations do not generally use a lot of memory. In fact, the word "Stream" in these types means that it does not need to hold the data, because it is accessed in a sequential manner -- in the same way that a stream can transfer water between a lake and the ocean without holding a lot of water itself.
But "stream" is not the best word to describe this. It's more like a pipe, because when you transfer data from a server to a client, every stage transfers back-pressure from the client that controls the rate at which data gets sent. This is similar to how your faucet controls the rate of flow through your pipes all the way to the city reservoir:
As the client reads data, it's InputStream only requests more data from the OS when its internal (small) buffers are empty. Each request allows only a limited amount of data to be transferred;
As data is requested from the OS, its own internal buffer empties, and it notifies the server about how much space there is for new data. The server can send only this much (that's called 'flow control' in TCP: https://en.wikipedia.org/wiki/Transmission_Control_Protocol#Resource_usage)
On the server side, the server-side OS sends out data from its own internal buffer when the client has space to receive it. As its own internal buffer empties, it allows the writing process to re-fill it with more data.
As the server-side process write()s to its OutputStream, the OutputStream will try to write data to the OS. When the OS buffer is full, it will make the server process wait until the server-side buffer has space to accept new data.
Notice that a slow client can make the server process take a very long time. If you're writing a server, and you don't control the clients, then it's very important to consider this and to ensure that there are not a lot of server-side resources tied up while a long data transfer takes place.
Your question is as interesting as difficult to answer properly.
First: InputStream and OutputStream are not a storage means, but an access means: They describe that the data shall be accessed in sequential, unidirectional order, but not how it shall be stored. The actual way of storing the data is implementation-dependent.
So, would there be an InputStream that stores the whole amount of data simultaneally in memory? Yes, could be, though it would be an appalling implementation. The most common and sensitive implementation of InputStreams / OutputStreams is by storing just a fixed and short amount of data into a temporary buffer of 4K-8K, for example.
(So far, I supposed you already knew that, but it was necessary to tell.)
Second: What about connected writting / reading streams between a server and a client? In a common scenario of buffered writting, the server will not write more data than the buffer allows. So, if the server starts writing, and the client then goes down (for whatever reason), the server will just keep writing until the buffer is full, and then set it as ready for reading, and until the read is not completed (by the client peer), the server won't fill the buffer again. Remember: This kind of read/write is blocking: The client blocks until there is a buffer ready to be read, and the server blocks (or, at least, the server thread bound to this connection, it's understood) until the last read is completed.
How many time will the server block? Typically, a server should have a security timeout to ensure that long blocks will break the connection, thus releasing the blocked thread. The same should have the client.
The timeouts set for the connection depend on the implementation, and the protocol.
No, it does not need to hold all data. I just advances forward in the file (usually using buffered data). The stream can discard old buffers as it pleases.
Note that there are a a lot of very different implementations of inputstreams, so the exact behaviour varies a lot.
With reference to the stackoverflow question it is said that the InputStream can be read multiple times with mark() and reset() provided by the InputStream or by using PushbackInputStream.
In all these cases the content of the stream is stored in byte array (ie; the original content of the file is stored in main memory) and reused multiple times.
What happens when the size of the file exceeds the memory size? I think this may pave way for OutOfMemoryException.
Is there any better way to read the stream content multiple times without storing the stream content locally (ie; in main memory)?
Please help me knowing this. Thanks in advance.
It depends on the source of the stream.
If it's a local file, you can likely re-open and re-read the stream as many times as you want.
If it's dynamically generated by a process, a remote service, etc., you might not be free to re-generate it. In that case, you need to store it, either in memory or in some more persistent (and slow) storage like a file system or storage service.
Maybe an analogy would help. Suppose your friend is speaking to you at length. You listen carefully without interruption, but when they are done, you realize you didn't understand something they said near the beginning, and want to review that portion.
At this point, there are a few possibilities.
Perhaps your friend was actually reading aloud from a book. You can simply re-read the book.
Or, perhaps you had to foresight to record their monologue. You can replay the recording.
However, since neither you nor your friend has perfect and unlimited recall, simply repeating verbatim what was said ten minutes ago from memory alone is not an option.
An InputStream is like your friend speaking. Neither of you has a good enough memory to remember exactly, word-for-word, what is said. In the same way, neither a process that is generating the data stream nor your program has enough RAM to store, byte-for-byte, the stream. To scale, your program has to rely on its "short-term memory" (RAM), working with just a small portion of the whole stream at any given time, and "taking notes" (writing to a persistent store) as it encounters important points.
If the source of stream is a local file, then it's like your friend reading a book. Either of you can re-read that content easily enough.
If you copy the stream to some persistent storage, that's like recording your friend's speech. You can replay it as often as you like.
Consider a scenario where browser is uploading a large file, but the server is busy, and not able to read that stream for some time. Where is that data stored during that delay?
Because the receiver can't always respond immediately to input, TCP and many other protocols allocate a small buffer to store some data from a sender. But, they also have a way to tell the sender to wait, they are sending data too fast—flow control. Going back to the analogy, it's like telling your friend to pause a moment while you catch up with your note-taking.
As the browser uploads the file, at first, the buffer will be filled. But if the server can't keep up, the browser will be instructed to pause its upload until there is more room in the buffer. (This generally happens at the OS and TCP level; the client and server applications don't manage this directly.) The upload speed depends on how fast the browser can read the file from disk, how fast the network link is, and how fast the server can process the uploaded data. Even a fast network and client will be limited by the weak link in this chain.
I have a c process which needs to send a lot of c structs (roughly 10,000 per second) to a java process that needs to put the data into a matching class.
The size of the data that needs to be sent will be around 50-100 Byte per packet.
Latency is a concern as the data needs to be displayed in real time, which is why i am looking for alternatives to my current solution.
Currently im doing this with JNI and a POSIX message queue. Is there a better way than using JNI and message queues/pipes? I read somewhere that calling JNI methods often has a lot of overhead. Could this become a problem when a lot of data has to be sent?
Another solution i had in mind was to just write the data to a UNIX Socket and parse it in java.
If you must process the data eventually using Java, then remove as many intermediate steps as possible.
If you can read the data directly into Java ( bypassing JNI and the C code ), then do so. Avoid the JNI, message queue and (presumably) a stage where C receives the data. The queue can't be helping latency either.
If the data starts in a C-friendly form that is unfriendly to Java, then I;d consider switching entirely to C or C++ rather than processing in Java at all.
You can achieve high throughput by avoiding unnecessary copying of memory. Transfer data between C and Java through direct byte buffers. From the Java side, you can then read the data in the byte buffers and copy values into the object fields. For two processes talking to each other, you could use a memory mapped file to do this (you would use a MappedByteBuffer for this).
For something simpler but with a bit more overhead, you could simply use the stdin/stdout of each process to communicate and send data that way. Or, as you suggested, a socket is another option.
With 10,000 structs at 100 bytes each, this will be 1 MB / second of data processed. This really shouldn't be a problem on modern hardware (for one of my projects I have managed easily over 1GB / second between direct buffers and Java objects with primitive + array fields, but this was all in one process between JNI and Java).
You might want to use the simpler approach first (stdin/stdout) and see is that fast enough first before digging into using memory mapped files.
This, I think, can be solved using typical IPC methods. I would use pipes over a message queue and sockets. The overhead of message queue is going to slow the processing. And pipes are faster than sockets but not by much.
For your issue, 10,000 structs at 100 bytes per packet comes to 1 MB/s. A modern multicore processor will handle this without a problem.
I want to stream data over network continuously. The source gives me a byte array that I'd want to store in a data structure which serves as buffer to compensate for any network lags.
What is the most efficient data structure to store the bytes in a queue fashion. Think of it as a pipe where one thread pumps in the data and other one reads and sends it over the network, while the pipe itself is long enough to contain multiple frames of the input data.
Is Queue efficient enough?
A Queue would not be efficient if you put bytes in one at a time. It would eat lots of memory, create GC pressure, and slow things down.
You could make the overhead of Queues reasonable if you put reasonably-sized (say 64kB) byte[]s or ByteBuffers in them. That buffer size could be tunable and changed based on performance experiments or perhaps even be adaptive at runtime.
TCP already compensates for network lags. If you are using UDP then you will need to handle congestion properly or things will go badly. In practice using TCP or UDP directly creates a lot of extra work and reinvention of wheels.
ZeroMQ (or the pure Java JeroMQ) is a good library option with an efficient wire protocol (good enough for realtime stock trading platforms). It handles the queueing transparently and gives a lot of options for different client models including things like PUB SUB that would help if you have lots of clients on a broadcast. Within a process ZeroMQ can manage the queueing of data being producuers and consumers. You could even use it to efficiently broadcast the same bytes to workers that do independent things with the same stream (ex: one doing usage metering and another doing transcoding).
There are other libraries that may also work. I think Netty handles things like this efficiently for example.
You should look into the OKIO libraray
I have a binary protocol with some unknown amount of initial header data (unknown length until header is fully decoded) followed by a stream of data that should be written to disk for later processing.
I have an implementation that decodes the header and then writes the file data to disk as it comes in from one or more frames in ChannelBuffers (so my handler subclasses FrameDecoder, and builds the message step by step, not waiting for one ChannelBuffer to contain the entire message, and writing the file data to disk with each frame). My concern is whether this is enough, or if using something like ChunkedWriteHandler does more than this and is necessary to handle large uploads.
Is there a more optimal way of handling file data than just writing it directly to disk from the ChannelBuffer with each frame?
It should be enough as long as the throughput is good enough. Otherwise, you might want to buffer the received data so that you don't make system calls too often (say, 32KiB bounded buffer) when the amount of received data is not large enough.
Netty could be even faster if it exposed the transferTo/From operation to a user, but such a feature is not available yet.
You should also think about adding an ExecutionHandler in front if your Handler. This will help you to not get blocked by the disk I/O. Otherwise you may see slow downs on heavy disk access.