Buffers and bytes?

Buffers and bytes? - java

Could someone explain to me the uses of using buffers, and perhaps some simple (documented) examples of a buffer in use. Thanks.
I lack much knowledge in this area of Java programming, so forgive me if I asked the question wrong. :s

A buffer is a space in memory where data is stored temporarily before it is processed. See Wiki article
Heres a simple Java example of how to use the ByteBuffer class.
Update
public static void main(String[] args) throws IOException
{
// reads in bytes from a file (args[0]) into input stream (inFile)
FileInputStream inFile = new FileInputStream(args[0]);
// creates an output stream (outFile) to write bytes to.
FileOutputStream outFile = new FileOutputStream(args[1]);
// get the unique channel object of the input file
FileChannel inChannel = inFile.getChannel();
// get the unique channel object of the output file.
FileChannel outChannel = outFile.getChannel();
/* create a new byte buffer and pre-allocate 1MB of space in memory
and continue to read 1mb of data from the file into the buffer until
the entire file has been read. */
for (ByteBuffer buffer = ByteBuffer.allocate(1024*1024); inChannel.read(buffer) != 1; buffer.clear())
{
// set the starting position of the buffer to be the current position (1Mb of data in from the last position)
buffer.flip();
// write the data from the buffer into the output stream
while (buffer.hasRemaining()) outChannel.write(buffer);
}
// close the file streams.
inChannel.close();
outChannel.close();
}
Hope that clears things up a little.

With a buffer, people usually mean some block of memory to temporarily store some data in. One primary use for buffers is in I/O operations.
A device like a harddisk is good at quickly reading or writing a block of consecutive bits on the disk in one go. Reading a large amount of data can be done very quickly if you tell the harddisk "read these 10,000 bytes and put them in memory here". If you would program a loop and get the bytes one by one, telling the harddisk to get one byte each time, it is going to be very inefficient and slow.
So you create a buffer of 10,000 bytes, tell the harddisk to read all the bytes in one go, and then you process those 10,000 bytes one by one from the buffer in memory.

The Sun Java tutorials section on I/O covers this topic:
http://java.sun.com/docs/books/tutorial/essential/io/index.html

Related

How does BufferedReader read files from S3?

I have a very large file (several GB) in AWS S3, and I only need a small number of lines in the file which satisfy a certain condition. I don't want to load the entire file in-memory and then search for and print those few lines - the memory load for this would be too high. The right way would be to only load those lines in-memory which are needed.
As per AWS documentation to read from file:
fullObject = s3Client.getObject(new GetObjectRequest(bucketName, key));
displayTextInputStream(fullObject.getObjectContent());
private static void displayTextInputStream(InputStream input) throws IOException {
// Read the text input stream one line at a time and display each line.
BufferedReader reader = new BufferedReader(new InputStreamReader(input));
String line = null;
while ((line = reader.readLine()) != null) {
System.out.println(line);
}
System.out.println();
}
Here we are using a BufferedReader. It is not clear to me what is happening underneath here.
Are we making a network call to S3 each time we are reading a new line, and only keeping the current line in the buffer? Or is the entire file loaded in-memory and then read line-by-line by BufferedReader? Or is it somewhere in between?

One of the answer of your question is already given in the documentation you linked:
Your network connection remains open until you read all of the data or close the input stream.
A BufferedReader doesn't know where the data it reads is coming from, because you're passing another Reader to it. A BufferedReader creates a buffer of a certain size (e.g. 4096 characters) and fills this buffer by reading from the underlying Reader before starting to handing out data of calls of read() or read(char[] buf).
The Reader you pass to the BufferedReader is - by the way - using another buffer for itself to do the conversion from a byte-based stream to a char-based reader. It works the same way as with BufferedReader, so the internal buffer is filled by reading from the passed InputStream which is the InputStream returned by your S3-client.
What exactly happens within this client if you attempt to load data from the stream is implementation dependent. One way would be to keep open one network connection and you can read from it as you wish or the network connection can be closed after a chunk of data has been read and a new one is opened when you try to get the next one.
The documentation quoted above seems to say that we've got the former situation here, so: No, calls of readLine are not leading to single network calls.
And to answer your other question: No, a BufferedReader, the InputStreamReader and most likely the InputStream returned by the S3-client are not loading in the whole document into memory. That would contradict the whole purpose of using streams in the first place and the S3 client could simply return a byte[][] instead (to come around the limit of 2^32 bytes per byte-array)
Edit: There is an exception of the last paragraph. If the whole gigabytes big document has no line breaks, calling readLine will actually lead to the reading of the whole data into memory (and most likely to a OutOfMemoryError). I assumed a "regular" text document while answering your question.

If you are basically not searching for a specific word/words, and you are aware of the bytes range, you can also use Range header in S3. This should be specifically useful as you are working with a single file of several GB size. Specifying Range not only helps to reduce the memory, but also is faster, as only the specified part of the file is read.
See Is there "S3 range read function" that allows to read assigned byte range from AWS-S3 file?
Hope this helps.
Sreram

Depends on the size of the lines in your file. readLine() will continue to build the string fetching data from the stream in blocks the size of your buffer size, until you hit a line termination character. So the memory used will be on the order of your line length + buffer length.

Only a single HTTP call is made to the AWS infrastructure, and the data is read into memory in small blocks, of which the size may vary and is not directly under your control.
This is very memory-efficient already, assuming each line in the file is a reasonably small size.
One way to optimize further (for network and compute resources), assuming that your "certain condition" is a simple string match, is to use S3 Select: https://aws.amazon.com/s3/features/#s3-select

Java understanding I/O streams

I/O streams in Java are the most misunderstood concept for me in programming.
Suppose, we get input stream from a socket connection:
DataInputStream in = new DataInputStream(clientSocket.getInputStream());
When I get data from a remote server, which of these describes things correctly?
Data stored in the in variable. When extra data comes from server, it appends to in, increasing its size. And then we can read data from in variable that way:
byte[] messageByte = new byte[1000];
boolean end = false;
String dataString = "";
while(!end)
{
bytesRead = in.read(messageByte);
messageString += new String(messageByte, 0, bytesRead);
if (messageString.length == 100)
{
end = true;
}
}
in is only a link to the source of data and doesn't contain data itself. When we call in.read(messageByte) 1000 bytes copy from the socket to bytesRead?
Alternatively, instead of a socket let's say we have stream connected to file on HDD. When we call in.read(messageByte) we read 1000 bytes from HDD, yes?
Which approach is right? I tend to think it's #2, but if so where is data stored in the socket case? Is the remote server waiting when we read 1000 bytes, and then sends extra data again? Or is data from the server stored in some buffer in the operating system?

Data stored in in variable.
No.
When extra data comes from server, it appends to in, increase it size. And then we can read data from in variable that way:
byte[] messageByte = new byte[1000];
boolean end = false;
String dataString = "";
while(!end)
{
bytesRead = in.read(messageByte);
messageString += new String(messageByte, 0, bytesRead);
if (messageString.length == 100)
{
end = true;
}
}
No. See below.
in is only link to source of data, and don't contains data themselves.
Correct.
And when we call in.read(messageByte); 1000 bytes copy from socket to bytesRead?
No. It blocks until:
at least one byte has been transferred, or
end of stream has occurred, or
an exception has been thrown,
whichever occurs first. See the Javadoc.
(Instead socket, we can have stream connected to file on HDD, and when we call in.read(messageByte) we read 1000 bytes from HDD. Yes?)
No. Same as above.
What approach right?
Neither of them. The correct way to read from an input stream is to loop until you have all the data you're expecting, or EOS or an exception occurs. You can't rely on read() filling the buffer. If you need that, use DataInputStream.readFully().
I tend to 2
That doesn't make sense. You don't have the choice. (1) and (2) aren't programming paradigms, they are questions about how the stream actually works. The question of how to write the code is distinct from that.
where data stored in socket?
Some of it is in the socket receive buffer in the kernel. Most of it hasn't arrived yet. None of it is 'in the socket'.
Or remote server waiting when we read 1000 bytes, and then send extra data again?
No. The server sends through its socket send buffer into your socket receive buffer. Your reads and the server's writes are very decoupled from each other.
Or data from server stored in any buffer in operating system?
Yes, the socket receive buffer.

It depends on the type of stream. Where the data is stored varies from stream to stream. Some have internal storage, some read from other sources, and some do both.
A FileInputStream reads from the file on disk when you request it to. The data is on disk, it's not in the stream.
A socket's InputStream reads from the operating systems buffers. When packets arrive the operating system automatically reads them and buffers up a small amount of data (say, 64KB). Reading from the stream drains that OS buffer. If the buffer is empty because no packets have arrived, your read call blocks. If you don't drain the buffer fast enough and it gets full then the OS will drop network packets until you free up some space.
A ByteArrayOutputStream has an internal byte[] array. When you write to the stream it stores your writes in that array. In this case the stream does have internal storage.
A BufferedInputStream is tied to another input stream. When you read from a BufferedInputStream it will typically request a bug chunk of data from the underlying stream and store it in a buffer. Subsequent read requests you issue are then satisfied with data from the buffer rather than performing additional I/O on the underlying stream. The goal is to minimize the number of individual read requests the underlying stream receives by issuing a smaller number of bulk reads. In this case the stream has a mixed strategy of some internal storage and some external reads.

Java compress/decompress large files (>1gb)

I have made an application in android that lets the user compress and decompress files and I used the package java.util.zip. Everything is okay. the speed, files are totally compressed and decompressed together with the directories. The only problem is that the application is not able to compress/decompress large files (greater than 1gb).
I believe the problem is the size of my buffer. Other codes that I've seen, the value of their buffer is 1024 or 2048 or 8192 but my value of my buffer is base on the size of the chosen file (just to make it flexible). But once the user chose a large file (with a size of >8 digits), that's were the error comes out. I searched over the net and also here in this site but I can't find an answer. my problem is similar to this:
To Compress a big file in a ZIP with Java
Thanks for the future help! :)
EDIT:
Thanks for the comments and answers. It really helped a lot. I thought BUFFER in compressing/decompressing in java means the size of file so in my program, I made the buffer size flexible (buffer size = file size). Will someone please explain how buffer works so I can understand why is it okay that BUFFER has a fixed value. Also for me to figure it out why others people is telling that it is much better if the buffer size is 8k or else. Thanks a lot! :)

If you size the buffer to the size of the file, then it means that you will have OutOfMemoryError whenever the file size is too big for memory available.
Use a normal buffer size and let it do it's work - buffering the data in a streaming fashion, one chunk at a time, rather than all in one go.
For explanation, see for example the documentation of BufferedOutputStream:
The class implements a buffered output stream. By setting up such an
output stream, an application can write bytes to the underlying output
stream without necessarily causing a call to the underlying system for
each byte written.
So using a buffer is more efficient than non-buffered writing.
And from the write method:
Ordinarily this method stores bytes from the given array into this
stream's buffer, flushing the buffer to the underlying output stream
as needed. If the requested length is at least as large as this
stream's buffer, however, then this method will flush the buffer and
write the bytes directly to the underlying output stream.
Each write causes the in-memory buffer to fill up, until the buffer is full. When the buffer is full, it is flushed and cleared. If you use a very large buffer, you will cause a large amount of data to be stored in memory before flushing. If your buffer is the same size as the input file, then you are saying you need to read the whole content into memory before flushing it. Using the default buffer size is usually just fine. There will be more physical writes (flushes); you avoid exploding memory.
By allowing you to specify a specific buffer size, the API is letting you choose the right balance between memory consumption and i/o to suit your application. If you tune your application for performance, you might end up tweaking buffer size. But the default size will be reasonable for many situations.

It sounds like it would help to simply set a maximum size for the buffer, something like:
//After calculating the buffer size bufSize:
bufSize = Math.min(bufSize, MAXSIZE);

I want to read a big text file

I want to read a big text file, what i decided to create four threads and read 25% of file by each one.
and then join them.
but its not more impressive.
can any one tell me can i use concurrent programming for the same.
as my file structure have some data as
name contact compnay policyname policynumber uniqueno
and I want to put all data in hashmap at last.
thanks

Reading a large file is typically limited by I/O performance, not by CPU time. You can't speed up the reading by dividing into multiple threads (it will rather decrease performance, since it's still the same file, on the same drive). You can use concurrent programming to process the data, but that can only improve performance after reading the file.
You may, however, have some luck by dedicating one single thread to reading the file, and delegate the actual processing from this thread to worker threads, whenever a data unit has been read.

If it is a big file chances are that it is written to disk as a contiguous part and "streaming" the data would be faster than parallel reads as this would start moving the heads back and forth. To know what is fastest you need intimate knowledge of your target production environment, because on high end storage the data will likely be distributed over multiple disks and parallel reads might be faster.
Best approach is i think is to read it with large chunks into memory. Making it available as a ByteArrayInputStream to do the parsing.
Quite likely you will peg the CPU during parsing and handling of the data. Maybe parallel map-reduce could help here spread the load over all cores.

You might want to use Memory-mapped file buffers (NIO) instead of plain java.io.

Well, you might flush the disk cache and put a high contention on the synchronization of the hashmap if you do it like that. I would suggest that you simply make sure that you have buffered the stream properly (possibly with a large buffer size). Use the BufferedReader(Reader in, int sz) constructor to specify buffer size.
If the bottle neck is not parsing the lines (that is, the bottle neck is not the CPU usage) you should not parallelize the task in the way described.
You could also look into memory mapped files (available through the nio package), but thats probably only useful if you want to read and write files efficiently. A tutorial with source code is available here: http://www.linuxtopia.org/online_books/programming_books/thinking_in_java/TIJ314_029.htm

well you can take help from below link
http://java.sun.com/developer/technicalArticles/Programming/PerfTuning/
OR
by using large buffer
or using this
import java.io.*;
public class line1 {
public static void main(String args[]) {
if (args.length != 1) {
System.err.println("missing filename");
System.exit(1);
}
try {
FileInputStream fis =
new FileInputStream(args[0]);
BufferedInputStream bis =
new BufferedInputStream(fis);
DataInputStream dis =
new DataInputStream(bis);
int cnt = 0;
while (dis.readLine() != null)
cnt++;
dis.close();
System.out.println(cnt);
}
catch (IOException e) {
System.err.println(e);
}
}
}

How to limit the maximum size read via ObjectInputStream from a Socket?

Is there a way to limit the maximum buffer size to be read from an ObjectInputStream in java?
I want to stop the deserialization if it becomes clear that the Object in question is crafted maliciously huge.
Of course, there is ObjectInputStream.read(byte[] buf, int off, int len), but I do not want to suffer the performance penalty of allocating, say byte[1000000].
Am I missing something here?

You write a FilterInputStream which will throw an exception if it discovers it has read more than a certain amount of data from its underlying stream.

I can see two ways:
1) do your reads in a loop, grabbing chunks whose allocation size you're comfortable with, and exit and stop when you hit your limit; or
2) Allocate your max-size buffer once and re-use it for subsequent reads.

Actually, there's a really easy way.
You can use NIO's ByteBuffer, and use the allocateDirect method. This method will allow you to allocate a memory-mapped file, so it doesn't have a huge overhead, and you can limit its size.
Then, instead of getting the stream from the socket, get the Channel.
Code:
Socket s;
ByteBuffer buffer = ByteBuffer.allocateDirect(10 * 1024 * 1024);
s.getChannel().read(buffer);
Now, don't try to call the "array()" method on the byte buffer; it doesn't work on a directly-allocated buffer. However, you can wrap the buffer as an input stream and send it to the ObjectInputStream for further processing.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.