Java stream file to socket as fast as possible

Java stream file to socket as fast as possible - java

I have a file of bytes mybytes.dat, and I'm writing a TCP Server/Client, where the server sends the contents of mybytes.dat to the client over a socket.
If mybytes.dat is read completely into memory, then the application can send data at about 160MB/s on my local network stack. However, now I'm trying to stream datafiles that are > 1GB, and shouldn't all be read into memory.
This related solution for sending a file in chunks seems appropriate; however, would it be more efficient to read large chunks of the file into memory (ie. maybe 1MB at a time in a buffer) and then write these as smaller chunks (32kb) to a socket? If this is reasonable, how can one use a BufferedFileReader to read large chunks, and then write smaller chunks to an OutputStream? To get started, let me declare at least some variables:
BufferedInputStream blobReader = new BufferedInputStream(newInputStream("mybytes.dat"), 1024*1024);
OutputStream socketWriter = socket.getOutputStream();
What is the correct way to connect my blobReader to the socketWriter, such that I always maintain enough bytes in memory to ensure the application is not limited by disk reads? Or am I completely offbase with this line of reasoning?

Related

How is InputStream managed in memory?

I am familiar with the concept of InputStream,buffers and why they are useful (when you need to work with data that might be larger then the machines RAM for example).
I was wondering though, how does the InputStream actually carry all that data?. Could a OutOfMemoryError be caused if there is TOO much data being transfered?
Case-scenario
If I connect from a client to a server,requesting a 100GB file, the server starts iterating through the bytes of the file with a buffer, and writing the bytes back to the client with outputStream.write(byte[]). The client is not ready to read the InputStream right now,for whatever reason. Will the server continue sending the bytes of the file indefinitely? and if so, won't the outputstream/inputstream be larger than the RAM of one of these machines?

InputStream and OutputStream implementations do not generally use a lot of memory. In fact, the word "Stream" in these types means that it does not need to hold the data, because it is accessed in a sequential manner -- in the same way that a stream can transfer water between a lake and the ocean without holding a lot of water itself.
But "stream" is not the best word to describe this. It's more like a pipe, because when you transfer data from a server to a client, every stage transfers back-pressure from the client that controls the rate at which data gets sent. This is similar to how your faucet controls the rate of flow through your pipes all the way to the city reservoir:
As the client reads data, it's InputStream only requests more data from the OS when its internal (small) buffers are empty. Each request allows only a limited amount of data to be transferred;
As data is requested from the OS, its own internal buffer empties, and it notifies the server about how much space there is for new data. The server can send only this much (that's called 'flow control' in TCP: https://en.wikipedia.org/wiki/Transmission_Control_Protocol#Resource_usage)
On the server side, the server-side OS sends out data from its own internal buffer when the client has space to receive it. As its own internal buffer empties, it allows the writing process to re-fill it with more data.
As the server-side process write()s to its OutputStream, the OutputStream will try to write data to the OS. When the OS buffer is full, it will make the server process wait until the server-side buffer has space to accept new data.
Notice that a slow client can make the server process take a very long time. If you're writing a server, and you don't control the clients, then it's very important to consider this and to ensure that there are not a lot of server-side resources tied up while a long data transfer takes place.

Your question is as interesting as difficult to answer properly.
First: InputStream and OutputStream are not a storage means, but an access means: They describe that the data shall be accessed in sequential, unidirectional order, but not how it shall be stored. The actual way of storing the data is implementation-dependent.
So, would there be an InputStream that stores the whole amount of data simultaneally in memory? Yes, could be, though it would be an appalling implementation. The most common and sensitive implementation of InputStreams / OutputStreams is by storing just a fixed and short amount of data into a temporary buffer of 4K-8K, for example.
(So far, I supposed you already knew that, but it was necessary to tell.)
Second: What about connected writting / reading streams between a server and a client? In a common scenario of buffered writting, the server will not write more data than the buffer allows. So, if the server starts writing, and the client then goes down (for whatever reason), the server will just keep writing until the buffer is full, and then set it as ready for reading, and until the read is not completed (by the client peer), the server won't fill the buffer again. Remember: This kind of read/write is blocking: The client blocks until there is a buffer ready to be read, and the server blocks (or, at least, the server thread bound to this connection, it's understood) until the last read is completed.
How many time will the server block? Typically, a server should have a security timeout to ensure that long blocks will break the connection, thus releasing the blocked thread. The same should have the client.
The timeouts set for the connection depend on the implementation, and the protocol.

No, it does not need to hold all data. I just advances forward in the file (usually using buffered data). The stream can discard old buffers as it pleases.
Note that there are a a lot of very different implementations of inputstreams, so the exact behaviour varies a lot.

Java understanding I/O streams

I/O streams in Java are the most misunderstood concept for me in programming.
Suppose, we get input stream from a socket connection:
DataInputStream in = new DataInputStream(clientSocket.getInputStream());
When I get data from a remote server, which of these describes things correctly?
Data stored in the in variable. When extra data comes from server, it appends to in, increasing its size. And then we can read data from in variable that way:
byte[] messageByte = new byte[1000];
boolean end = false;
String dataString = "";
while(!end)
{
bytesRead = in.read(messageByte);
messageString += new String(messageByte, 0, bytesRead);
if (messageString.length == 100)
{
end = true;
}
}
in is only a link to the source of data and doesn't contain data itself. When we call in.read(messageByte) 1000 bytes copy from the socket to bytesRead?
Alternatively, instead of a socket let's say we have stream connected to file on HDD. When we call in.read(messageByte) we read 1000 bytes from HDD, yes?
Which approach is right? I tend to think it's #2, but if so where is data stored in the socket case? Is the remote server waiting when we read 1000 bytes, and then sends extra data again? Or is data from the server stored in some buffer in the operating system?

Data stored in in variable.
No.
When extra data comes from server, it appends to in, increase it size. And then we can read data from in variable that way:
byte[] messageByte = new byte[1000];
boolean end = false;
String dataString = "";
while(!end)
{
bytesRead = in.read(messageByte);
messageString += new String(messageByte, 0, bytesRead);
if (messageString.length == 100)
{
end = true;
}
}
No. See below.
in is only link to source of data, and don't contains data themselves.
Correct.
And when we call in.read(messageByte); 1000 bytes copy from socket to bytesRead?
No. It blocks until:
at least one byte has been transferred, or
end of stream has occurred, or
an exception has been thrown,
whichever occurs first. See the Javadoc.
(Instead socket, we can have stream connected to file on HDD, and when we call in.read(messageByte) we read 1000 bytes from HDD. Yes?)
No. Same as above.
What approach right?
Neither of them. The correct way to read from an input stream is to loop until you have all the data you're expecting, or EOS or an exception occurs. You can't rely on read() filling the buffer. If you need that, use DataInputStream.readFully().
I tend to 2
That doesn't make sense. You don't have the choice. (1) and (2) aren't programming paradigms, they are questions about how the stream actually works. The question of how to write the code is distinct from that.
where data stored in socket?
Some of it is in the socket receive buffer in the kernel. Most of it hasn't arrived yet. None of it is 'in the socket'.
Or remote server waiting when we read 1000 bytes, and then send extra data again?
No. The server sends through its socket send buffer into your socket receive buffer. Your reads and the server's writes are very decoupled from each other.
Or data from server stored in any buffer in operating system?
Yes, the socket receive buffer.

It depends on the type of stream. Where the data is stored varies from stream to stream. Some have internal storage, some read from other sources, and some do both.
A FileInputStream reads from the file on disk when you request it to. The data is on disk, it's not in the stream.
A socket's InputStream reads from the operating systems buffers. When packets arrive the operating system automatically reads them and buffers up a small amount of data (say, 64KB). Reading from the stream drains that OS buffer. If the buffer is empty because no packets have arrived, your read call blocks. If you don't drain the buffer fast enough and it gets full then the OS will drop network packets until you free up some space.
A ByteArrayOutputStream has an internal byte[] array. When you write to the stream it stores your writes in that array. In this case the stream does have internal storage.
A BufferedInputStream is tied to another input stream. When you read from a BufferedInputStream it will typically request a bug chunk of data from the underlying stream and store it in a buffer. Subsequent read requests you issue are then satisfied with data from the buffer rather than performing additional I/O on the underlying stream. The goal is to minimize the number of individual read requests the underlying stream receives by issuing a smaller number of bulk reads. In this case the stream has a mixed strategy of some internal storage and some external reads.

What is the maximum limit of data that can be written on SSL socket in one shot in java?

I am writing about 10000 bytes on SSL socket in one shot by taking OutputStream instance from it
OutputStrem os = ssl_Socket.getOutputStream();
os is OutputStream here. It writes the data successfully to the server, but the data received at server end is getting corrupted somehow.
But If I use BufferedOutputStream everthing works fine.
os = new BufferedOutputStream(c._s.getOutputStream(), 8196);
My Question :
Is there any limit on data that can be written on SSL socket in one shot ?
Is there any default buffer size ?
Why it worked successfully with BufferedOutputStream ? Since I have to write large chunk of data I don't want to use BufferedOutputStream ?

Is there any limit on data that can be written on SSL socket in one shot?
There is no limit other than Integer.MAX_VALUE. The SSLSocket's output stream will block until all the data has been sent, including encryption and packaging into the requisite number of underlying SSL records.
Is there any default buffer size?
BufferedOutputStream has a default buffer size of 8192. 8196 is a curious number to use for a buffer size, but you should certainly always use a buffered stream or writer over an SSLSocket's output stream. Otherwise you can get a data explosion of up to 42 times, if you write a byte at a time
Why it worked successfully with BufferedOutputStream ? Since I have to write large chunk of data I don't want to use BufferedOutputStream?
You don't have to use a BufferedOutputStream, but it doesn't hurt, even if you're writing large chunks of data. The buffer is bypassed when possible.
Your problems are almost certainly at the receiving end.

"[TLS] specifies a fixed maximum plaintext fragment length of 2^14 bytes." - which is 16K.
Read about "max_fragment_length" TLS extension which can limit the size of block.
PS: Not familiar with Java SSL library, maybe there is something specific.

Where is the data queued with a BufferedReader?

I am reading a large csv from a web service Like this:
br = new BufferedReader(new InputStreamReader(website.openStream(), "UTF-16"));
I read line by line and write into a database. The writing into a database is the bottleneck of this operation and I am wondering if it is possible that I will "timeout" with the webservice so I get the condition where the webservice just cuts the connection because I am not reading anything from it...
Or does the BufferedReader just buffer the stream into memory until I read from it?

yes, it is possible that the webservice stream will timeout while you are writing to the db. If the db is really slow enough that this might timeout, then you may need to copy the file locally before pushing it into the db.

+1 for Brian's answer.
Furthermore, I would recommend you have a look at my csv-db-tools on GitHub. The csv-db-importer module illustrates how to import large CSV files into the database. The code is optimized to insert one row at a time and keep the memory free from data buffered from large CSV files.

BufferedReader will, as you have speculated, read the contents of the stream into memory. Any calls to read or readLine will read data from the buffer, not from the original stream, assuming the data is already available in the buffer. The advantage here is that data is read in larger batches, rather than requested from the stream at each invocation of read or readLine.
You will likely only experience a timeout like you describe if you are reading large amounts of data. I had some trouble finding a credible reference but I have seen several mentions of the default buffer size on BufferedReader being 8192 bytes (8kb). This means that if your stream is reading more than 8kb of data, the buffer could potentially fill and cause your process to wait on the DB bottleneck before reading more data from the stream.
If you think you need to reserve a larger buffer than this, the BufferedReader constructor is overloaded with a second parameter allowing you to specify the size of the buffer in bytes. Keep in mind, though, that unless you are moving small enough pieces of data to buffer the entire stream, you could run into the same problem even with a larger buffer.
br = new BufferedReader(new InputStreamReader(website.openStream(), "UTF-16"), size);
will initialize your BufferedReader with a buffer of size bytes.
EDIT:
After reading #Keith's comment, I think he's got the right of it here. If you experience timeouts the smaller buffer will cause you to read from the socket more frequently, hopefully eliminating that issue. If he posts an answer with that you should accept his.

BufferedReader just reads in chunks into an internal buffer, whose default size is unspecified but has been 4096 chars for many years. It doesn't do anything while you're not calling it.
But I don't think your perceived problem even exists. I don't see how the web service will even know. Write timeouts in TCP are quite difficult to implement. Some platforms have APIs for that, but they aren't supported by Java.
Most likely the web service is just using a blocking mode socket and it will just block in its write if you aren't reading fast enough.

How to read huge data in socket and also write into socketchannel

How to read a very big data using DataInputStream of socket If the data is in String format and having a length of more than 1,00,000 characters.
Also how to write that big data using SocketChannel in java?

The problem is that your data is arriving in chunks. Either the packet size is limiting that or maybe DataInputStream has an internal buffer of only 40k. I don't know, but it doesn't matter. Either way, all 1000000 bytes will not arrive at once. So you have to rewrite your program to expect that. You need to read the smaller chunks that you receive and store them in another byte[1000000] variable (keeping track of where your last byte index). Keep looping until you are done reading the socket. Then you can work with your internal variable.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.