Is there a limit on process input stream when using java? - java

I am creating a process using java runtime on a solaris OS. I then get inputstream from the process and do a read on the input stream. I expect (I am not too sure about the process, it is a 3rd party thing)the process outstream to be huge but it seems to be clipped. Could it be that there is a threshold on java side as to how much a process can have in its output stream?
Thanks,
Abdul

There is no limit to the amount of data you can read, if you read repeatedly. You cannot read more than 2 GB at once and some stream types might only give you a few KB at a time. e.g. a slow Socket will often given you 1.5 KB or less (based on the MTU of the connection)
If you call int read(byte[]) it is only guaranteed to read 1 byte. It is a common mistake to assume you will read the full buffer every time. If you need this you can use DataInputStream.readFully(byte[])

By "process output stream" do you mean STDOUT? STDERR? Or you have an OutputStream object that you direct to somewhere? (a file?)
If you write to a file - you might see clipped data if you don't close your output stream. As long as you go by the book (outputstream.close() when you are done writing) you are good to go. Notice that there are some underlying limitations like Storage space (obvious) or file system limitations (some limit the file size).
If you write to STDOUT/STDERR - As far as I know you are fine. Notice again that if you write your output to a terminal, or through Eclipse (for example), then they might have a buffer and therefore limit your output (but then, it's most likely that you'll get the first part of data missing and not the last part of it).

You shouldn't run into limitations on InputStream or OutputStream if it is properly implemented. The most likely resource to run into limitations on is memory when allocating objects either from the input or to the output - for example trying to read a 100GB file into memory to then write to an output. If you need to load very large objects into memory to or from a stream, make sure to use a 64bit JVM and allocate as much memory to it as you can, however testing is the only way to determine the ideal values.

Related

Efficiently sharing data between processes in different languages

Context
I am writing a Java program that communicates with a C# program through standard in and standard out. The C# program is started as a child process. It gets "requests" through stdin and sends "responses" through stdout. The requests are very lightweight (a few bytes size), but the responses are large. In a normal run of the program, the responses amount for about 2GB of data.
I am looking for ways to improve performance, and my measurements indicate that writing to stdout is a bottleneck. Here are the numbers from a normal run:
Total time: 195 seconds
Data transferred through stdout: 2026MB
Time spent writing to stdout: 85 seconds
stdout throughput: 23.8 MB/s
By the way, I am writing all the bytes to an in-memory buffer first, and copying them in one go to stdout to make sure I only measure stdout write time.
Question
What is an efficient and elegant way to share data between the C# child process and the Java parent process? It is clear that stdout is not going to be enough.
I have read here and there about sharing memory through memory mapped files, but the Java and .NET APIs give me the impression that I'm looking in the wrong place.
Before you invest more in memory mapped files or named pipes I would first check whether you actually read and write efficiently. java.lang.Process.getInputStream() uses a BufferedInputStream, so the reader side should be OK. But in your C# program you will most likely use Console.Write. The problem here is that AutoFlush is enabled by default. So every single write explicitely flushes the stream. I wrote my last C# code years ago, so I'm not up-to-date. But maybe it is possible to set the AutoFlush property of Console.Out to false and flush the stream manually after multiple writes.
If disabling AutoFlush should not be possible the only way to improve performance with Console.Out would be to write more text with a single write.
Another potential bottleneck may be a shell in between that has to interpret the written data. Ensure that you execute the C# program directly and not through a script or by calling the command executor.
Before you start using memory mapped files I would first try to simply write into a file. As long as you have enough free memory that is not used by your programs or others and as long as there are no other programs with frequent disk access the operating system will be able to hold quite a big amount of written data within the file system cache. As long as your Java program reads fast enough from file while your C# program is writing to the file chances are high that only some or even no data has to be loaded from disk.
As Matthew Watson mentioned in the comments, it is indeed possible and incredibly fast to use a memory mapped file. In fact, the throughput for my program went from 24 MB/s to 180 MB/s. Below is the gist of it.
The following Java code creates the memory mapped file used for communication and opens a buffer we can read from:
var path = Paths.get("test.mmap");
var channel = FileChannel.open(path, StandardOpenOption.READ, StandardOpenOption.WRITE, StandardOpenOption.CREATE);
var mappedByteBuffer = channel.map(FileChannel.MapMode.READ_WRITE, 0, 200_000 * 8);
The following C# code opens the memory mapped file and creates a stream that you can use to write bytes to it (note that buffer is the name of the array of bytes to be written):
// This code assumes the file has already been created on the Java side
var file = File.Open("test.mmap", FileMode.Open, FileAccess.ReadWrite, FileShare.ReadWrite);
var memoryMappedFile = MemoryMappedFile.CreateFromFile(file, fileName, 0, MemoryMappedFileAccess.ReadWrite, HandleInheritability.None, false);
var stream = memoryMappedFile.CreateViewStream();
stream.Write(buffer, 0, buffer.Length);
stream.Flush();
Of course, you need to somehow synchronize the Java and the C# side. For the sake of simplicity, I didn't include that in the code above. In my code, I am using standard in and standard out to signal when it is safe to read / write.

Buffered vs Unbuffered. How actually buffer work?

How actually a buffer optimize the process of reading/writing?
Every time when we read a byte we access the file. I read that a buffer reduces the number of accesses the file. The question is how?. In the Buffered section of picture, when we load bytes from the file to the buffer we access the file just like in Unbuffered section of picture so where is the optimization?
I mean ... the buffer must access the file every time when reads a byte so
even if the data in the buffer is read faster this will not improve performance in the process of reading. What am I missing?
The fundamental misconception is to assume that a file is read byte by byte. Most storage devices, including hard drives and solid-state discs, organize the data in blocks. Likewise, network protocols transfer data in packets rather than single bytes.
This affects how the controller hardware and low-level software (drivers and operating system) work. Often, it is not even possible to transfer a single byte on this level. So, requesting the read of a single byte ends up reading one block and ignoring everything but one byte. Even worse, writing a single byte may imply reading an entire block, changing one bye of it, and writing the block back to the device. For network transfers, sending a packet with a payload of only one byte implies using 99% of the bandwidth for metadata rather than actual payload.
Note that sometimes, an immediate response is needed or a write is required to be definitely completed at some point, e.g. for safety. That’s why unbuffered I/O exists at all. But for most ordinary use cases, you want to transfer a sequence of bytes anyway and it should be transferred in chunks of a size suitable to the underlying hardware.
Note that even if the underlying system injects a buffering on its own or when the hardware truly transfers single bytes, performing 100 operating system calls to transfer a single byte on each still is significantly slower than performing a single operating system call telling it to transfer 100 bytes at once.
But you should not consider the buffer to be something between the file and your program, as suggested in your picture. You should consider the buffer to be part of your program. Just like you would not consider a String object to be something between your program and a source of characters, but rather a natural way to process such items. E.g. when you use the bulk read method of InputStream (e.g. of a FileInputStream) with a sufficiently large target array, there is no need to wrap the input stream in a BufferedInputStream; it would not improve the performance. You should just stay away from the single byte read method as much as possible.
As another practical example, when you use an InputStreamReader, it will already read the bytes into a buffer (so no additional BufferedInputStream is needed) and the internally used CharsetDecoder will operate on that buffer, writing the resulting characters into a target char buffer. When you use, e.g. Scanner, the pattern matching operations will work on that target char buffer of a charset decoding operation (when the source is an InputStream or ByteChannel). Then, when delivering match results as strings, they will be created by another bulk copy operation from the char buffer. So processing data in chunks is already the norm, not the exception.
This has been incorporated into the NIO design. So, instead of supporting a single byte read method and fixing it by providing a buffering decorator, as the InputStream API does, NIO’s ByteChannel subtypes only offer methods using application managed buffers.
So we could say, buffering is not improving the performance, it is the natural way of transferring and processing data. Rather, not buffering is degrading the performance by requiring a translation from the natural bulk data operations to single item operations.
As stated in your picture, buffered file contents are saved in memory and unbuffered file is not read directly unless it is streamed to program.
File is only representation on path only. Here is from File Javadoc:
An abstract representation of file and directory pathnames.
Meanwhile, buffered stream like ByteBuffer takes content (depends on buffer type, direct or indirect) from file and allocate it into memory as heap.
The buffers returned by this method typically have somewhat higher allocation and deallocation costs than non-direct buffers. The contents of direct buffers may reside outside of the normal garbage-collected heap, and so their impact upon the memory footprint of an application might not be obvious. It is therefore recommended that direct buffers be allocated primarily for large, long-lived buffers that are subject to the underlying system's native I/O operations. In general it is best to allocate direct buffers only when they yield a measureable gain in program performance.
Actually depends on the condition, if the file is accessed repeatedly, then buffered is a faster solution rather than unbuffered. But if the file is larger than main memory and it is accessed once, unbuffered seems to be better solution.
Basically for reading if you request 1 byte the buffer will read 1000 bytes and return you the first byte, for next 999 reads for 1 byte it will not read anything from the file but use its internal buffer in RAM. Only after you read all the 1000 bytes it will actually read another 1000 bytes from the actual file.
Same thing for writing but in reverse. If you write 1 byte it will be buffered and only if you have written 1000 bytes they may be written to the file.
Note that choosing the buffer size changes the performance quite a bit, see e.g. https://stackoverflow.com/a/237495/2442804 for further details, respecting file system block size, available RAM, etc.

Are there any performance benefits to leaving BufferedReader stream open?

Before I ask my question, I am fully aware that leaving an input stream open can cause a memory leak, and therefore doing so is bad practice.
Consider the following preconditions:
Only a single file is needed to be read
The file in question is a text file which contains rows of data
This file is quite large: 50MB or more
The file is read many, many times during a test run
The reason I am asking is that in my test automation suite, the same file is required to be called over and over again to validate certain data fields.
In its current state, the data reader function opens a BufferedReader stream, reads/returns data, and then closes stream.
However, due to the file size and the number of times the file is read, I don't know if leaving the stream open would be beneficial. If I'm being honest, I don't know if the file size affects the opening of the stream at all.
So in summary, given the above listed preconditions, will leaving open a BufferedReader input stream improve overall performance? And is a memory leak still possible?
If you have enough memory to do this, then you will probably get best performance by reading the entire file into a StringBuilder, turning it into a String, and then repeatedly reading from the String via a StringReader.
However, you may need 6 or more times as many bytes of (free) heap space as the size of the file.
2 x to allow for byte -> char expansion
3 x because of the way that a StringBuilder buffer expands as it grows.
You can save space by holding the file in memory as as bytes (not chars), and by reading into a byte[] of exactly the right size. But then you need to repeat the bytes -> chars decoding each time you read from the byte[].
You should benchmark the alternatives if you need ultimate performance.
And look at using Buffer to reduce copying.
Re your idea. Keeping the BufferedReader open and using mark and reset would give you a small speedup compared with closing and reopening. But the larger your file is, the smaller the speedup is in relative terms. For a 50GB file, I suspect that the speedup would be insignificant.
Yes, not closing a stream could improve performance in theory as the object will not trigger garbage collection
assuming you're not de-referencing the BufferedReader. Also, the undelying resources won't need to be sync'd. See similar answer: Performance hit opening and closing filehandler?
However, not closing you BufferedReader will result in memory leak and you'll see heap increase.
I suggest as other's have in comments and answers to just read the file into a memory and use that. A 50MB file that isn't that much, plus the performance reading from a String once in memory will be much higher than re-reading a file.

Is it more effective to buffer an output stream than an input stream in Java?

Being bored earlier today I started thinking a bit about the relative performance of buffered and unbuffered byte streams in Java. As a simple test, I downloaded a reasonably large text file and wrote a short program to determine the effect that buffered streams has when copying the file. Four tests were performed:
Copying the file using unbuffered input and output byte streams.
Copying the file using a buffered input stream and an unbuffered output stream.
Copying the file using an unbuffered input stream and a buffered output stream.
Copying the file using buffered input and output streams.
Unsurprisingly, using buffered input and output streams is orders of magnitude faster than using unbuffered streams. However, the really interesting thing (to me at least) was the difference in speed between cases 2 and 3. Some sample results are as follows:
Unbuffered input, unbuffered output
Time: 36.602513585
Buffered input, unbuffered output
Time: 26.449306847
Unbuffered input, buffered output
Time: 6.673194184
Buffered input, buffered output
Time: 0.069888689
For those interested, the code is available here at Github. Can anyone shed any light on why the times for cases 2 and 3 are so asymmetric?
When you read a file, the filesystem and devices below it do various levels of caching. They almost never read one byte at at time; they read a block. On a subsequent read of the next byte, the block will be in cache and so will be much faster.
It stands to reason then that if your buffer size is the same size as your block size, buffering the input stream doesn't actually gain you all that much (it saves a few system calls, but in terms of actual physical I/O it doesn't save you too much).
When you write a file, the filesystem can't cache for you because you haven't given it a backlog of things to write. It could potentially buffer the output for you, but it has to make an educated guess at how often to flush the buffer. By buffering the output yourself, you let the device do much more work at once because you manually build up that backlog.
To your title question, it is more effective to buffer the output. The reason for this is the way Hard Disk Drives (HDDs) write data to their sectors. Especially considering fragmented disks. Reading is much faster because the disk already knows where the data is versus having to determine where it will fit. Using the buffer the disk will find larger contiguous blank space to save the data than in the unbuffered manner.
Run another test for giggles. Create a new partition on your disk and run your tests reading and writing to the clean slate. To compare apples to apples, format the newly created partition between tests. Please post your numbers after this if you run the tests.
Generally writing is more tedious for the computer cause it cannot cache while reading can. Generally it is much like in real life - reading is faster and easier than writing!

In Java, what is the difference between using a BufferedWriter or writing straight to file? [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
In Java, what is the advantage of using BufferedWriter to append to a file?
The site that I am looking at says
"The BufferWriter class is used to write text to a character-output stream, buffering characters so as to provide for the efficient writing of single characters, arrays, and strings."
What make's it more efficient and why?
BufferedWriter is more efficient because it uses buffers rather than writing character by character. So it reduces I/O operations of the disk. Data is collected in a buffer and write to the file when the buffer is full.
This is why sometimes no data is written in the file if you didn't call flush method. That is data is collected in the buffer but program exits before writing them to the file. Calling flush method will cause the data to be written in the file even the buffer is not filled completely.
The cost of writing becomes expensive when you write character by character to the file. For reducing that cost, buffers are provided. If you are writing to Buffer, it waits for some limit and then writes the whole to the disk.
A BufferedWriter waits until the buffer (8192 bytes) is full and writes the whole buffer in one disk operation. Unbuffered each single write would result in a disk I/O which is obviously more expensive.
Hard disk hava a minimum unit of information storage so for example if you are writing a single byte the operating system asks for the disk to store a unit of storage (I think that the minimum is 512 bytes). So you ask for writing one byte and the operating system writes much more. If you ask to store 512 bytes with 512 calls you end up doing a lot more I/O (512 disk operations) that buffering 512 bytes and issuing only one call (1 disk operation).
As the name suggests, BufferedWriter uses a buffer to reduce the costs of writes. If you are writing to file, you might know that writing 1byte or writing 4kbytes roughly costs the same. The time required to perform such write is dominated by the access time (~8ms) which is the time required by the disk to rotate and to seek the right sector.
Additionally, aggregating small writes in a bigger one allows you to reduce the overhead on the operating system, achieving better performances.
Most of the operating systems do have an internal buffer to cache writes. However, these caches tries to figure out what the application is doing, by analyzing the write patterns. If the application itself is able to perform that caching, and perform a write only when the data is ready, the result (in terms of performance) is better.

Categories

Resources