I wanted to pipe an OutputStream to an InputStream such that every time I write to my OutputStream those bytes become available in my InputStream.
Reading the JDK documentation I found the PipeInputStream and the PipeOutputStream which seemed a good alternative, e.g.
PipedOutputStream out = new PipedOutputStream();
PipedInputStream in = new PipedInputStream(out);
However, the documentation explicitly states the two streams must be used in separate threads.
Typically, data is read from a PipedInputStream object by one thread and data is written to the corresponding PipedOutputStream by some other thread. Attempting to use both objects from a single thread is not recommended, as it may deadlock the thread.
Is there another easy way to pipe these two streams to run in the same thread using some other features available in Java?
I suppose the major issue here is buffering the data being written by to output, since while we're writing, those bytes get buffered somewhere to be later consumed by the input reading part.
However I'm working with discrete amounts of data, of the kind that would easily fit in memory. So for me buffering a few bytes is not a big concern. I'm more interested in finding a simple pattern to do this.
Being that the case, I thought I could easily do this manually by writing everything to a ByteArrayOutputStream and then get the bytes from it and read them again in a ByteArrayInputStream.
However this piping scenario seems such a natural use case that I was wondering if there's another simpler way to pipe two streams in a single-threaded application, e.g.
output.pipe(input);
message.writeTo(output);
process(input);
Related
I want to know the distinction to clear the conceptual difference as I have been seeing SocketChannel, FileChannel etc. classes. compared to Socket and File I/O Streams
As I know, I/O Streams must be accessed sequentially i.e. they are a sequence of bytes which can be read and written to. You can also use Buffered Stream to increase efficiency of I/O as well.
So, compared to Streams, are "Channels" a totally new concept or just a wrapper over Streams?
Yes, If we say "Stream is a sequence of bytes" then what is a Channel in that sense if both are different?
Neither. Channels are not wrappers around streams (unless you explicitly wrap a stream via Channels.newChannel(InputStream) or Channels.newChannel(OutputStream)) and they are not a “totally new concept”.
Depending on the particular type, a channel still represents a sequence of bytes that may be read or written sequentially. The fact that you can translate between these APIs via factory methods in the Channels class shows that there is a relationship.
But the NIO API addresses certain design issues which could not be fixed by refactoring the old stream classes (in a compatible way). E.g. the base types are interfaces now, which allows certain channels to implement multiple types, like ReadableByteChannel and WritableByteChannel at the same time. Further, there is no method for reading a single byte, which is a good way to get rid of the “You can use BufferedStream to increase efficiency” myth. If an insufficient buffer size is the cause of an I/O performance bottleneck, you solve it by providing a larger buffer in the first place, rather than wrapping a stream or channel into another, forcing it to copy all data between buffers. Consequently, there is no BufferedChannel.
Certain implementations like FileChannel offer additional methods allowing random access to the underlying resource, in addition to the sequential access. That way, you can use a uniform interface, instead of dealing with entirely different APIs, like with the RandomAccessFile/ InputStream/ OutputStream relationship.
Further, a lot of previously missing I/O features were added when NIO was introduced. Some of them could have been implemented atop the old classes without problems, but the designers clearly favored using the new API for all of them, where these features could be considered in the design right from the start.
But generally, as said, a channel isn’t an entirely new concept compared to streams.
I'm currently using Java sockets in a client-server application with OutputStream and not BufferedOutputStream (and the same for input streams).
The client and server exchanges serialized objects (writeObject() method).
Does it make sense (more speed) to use BufferedOutputStream and BufferedInputStream in this case?
And when I have to flush or should I not write a flush() statement?
Does it make sense (more speed) to use BufferedOutputStream and BufferedInputStream in this case?
Actually, it probably doesn't make sense1.
The object stream implementation internally wraps the stream it has been given with a private class called BlockDataOutputStream that does buffering. If you wrap the stream yourself, you will have two levels of buffering ... which is likely to make performance worse2.
And when I have to flush or should I not write a flush() statement?
Yes, flushing is probably necessary. But there is no universal answer as to when to do it.
On the one hand, if you flush too often, you generate extra network traffic.
On the other hand, if you don't flush when it is needed, the server can stall waiting for an object that the client has written but not flushed.
You need to find the compromise between these two syndromes ... and that depends on your application's client/server interaction patterns; e.g. whether the message patterns are synchronous (e.g. message/response) or asynchronous (e.g. message streaming).
1 - To be certain on this, you would need to do some forensic testing to 1) measure the system performance, and 2) determine what syscalls are made and when network packets are sent. For a general answer, you would need to repeat this for a number of use-cases. I'd also recommend looking at the Java library code yourself to confirm my (brief) reading.
2 - Probably only a little bit worse, but a well designed benchmark would pick up a small performance difference.
UPDATE
After writing the above, I found this Q&A - Performance issue using Javas Object streams with Sockets - which seems to suggest that using BufferedInputStream / BufferedOutputStream helps. However, I'm not certain whether the performance improvement that was reported is 1) real (i.e. not a warmup artefact) and 2) due to the buffering. It could be just due to adding the flush() call. (Why: because the flush could cause the network stack to push the data sooner.)
I think these links might help you:
What is the purpose of flush() in Java streams?
The flush method flushes the output stream and forces any buffered output bytes to be written out. The general contract of flush is that calling it is an indication that, if any bytes previously written have been buffered by the implementation of the output stream, such bytes should immediately be written to their intended destination.
How java.io.Buffer* stream differs from normal streams?
Internally a buffer array is used and instead of reading bytes individually from the underlying input stream enough bytes are read to fill the buffer. This generally results in faster performance as less reads are required on the underlying input stream.
http://www.oracle.com/technetwork/articles/javase/perftuning-137844.html
As a means of starting the discussion, here are some basic rules on how to speed up I/O: 1.Avoid accessing the disk. 2.Avoid accessing the underlying operating system. 3.Avoid method calls. 4.Avoid processing bytes and characters individually.
So using Buffered-Streams usually speeds speeds up the IO-processe, as less read() are done in the background.
I am reading a large csv from a web service Like this:
br = new BufferedReader(new InputStreamReader(website.openStream(), "UTF-16"));
I read line by line and write into a database. The writing into a database is the bottleneck of this operation and I am wondering if it is possible that I will "timeout" with the webservice so I get the condition where the webservice just cuts the connection because I am not reading anything from it...
Or does the BufferedReader just buffer the stream into memory until I read from it?
yes, it is possible that the webservice stream will timeout while you are writing to the db. If the db is really slow enough that this might timeout, then you may need to copy the file locally before pushing it into the db.
+1 for Brian's answer.
Furthermore, I would recommend you have a look at my csv-db-tools on GitHub. The csv-db-importer module illustrates how to import large CSV files into the database. The code is optimized to insert one row at a time and keep the memory free from data buffered from large CSV files.
BufferedReader will, as you have speculated, read the contents of the stream into memory. Any calls to read or readLine will read data from the buffer, not from the original stream, assuming the data is already available in the buffer. The advantage here is that data is read in larger batches, rather than requested from the stream at each invocation of read or readLine.
You will likely only experience a timeout like you describe if you are reading large amounts of data. I had some trouble finding a credible reference but I have seen several mentions of the default buffer size on BufferedReader being 8192 bytes (8kb). This means that if your stream is reading more than 8kb of data, the buffer could potentially fill and cause your process to wait on the DB bottleneck before reading more data from the stream.
If you think you need to reserve a larger buffer than this, the BufferedReader constructor is overloaded with a second parameter allowing you to specify the size of the buffer in bytes. Keep in mind, though, that unless you are moving small enough pieces of data to buffer the entire stream, you could run into the same problem even with a larger buffer.
br = new BufferedReader(new InputStreamReader(website.openStream(), "UTF-16"), size);
will initialize your BufferedReader with a buffer of size bytes.
EDIT:
After reading #Keith's comment, I think he's got the right of it here. If you experience timeouts the smaller buffer will cause you to read from the socket more frequently, hopefully eliminating that issue. If he posts an answer with that you should accept his.
BufferedReader just reads in chunks into an internal buffer, whose default size is unspecified but has been 4096 chars for many years. It doesn't do anything while you're not calling it.
But I don't think your perceived problem even exists. I don't see how the web service will even know. Write timeouts in TCP are quite difficult to implement. Some platforms have APIs for that, but they aren't supported by Java.
Most likely the web service is just using a blocking mode socket and it will just block in its write if you aren't reading fast enough.
I am using a file as a cache for big data. One thread writes to it sequentially, another thread reads it sequentially.
Can I be sure that all data that has been written (by write()) in one thread can be read() from another thread, assuming a proper "happens-before" relationship in terms of the Java memory model? Is this behavior documented?
In my JDK, FileOutputStream does not override flush(), and OutputStream.flush() is empty. That's why I'm wondering...
The streams in question are owned exclusively by a class that I have full control of. Each stream is guaranteed to be accesses by one thread only. My tests show that it works as expected, but I'm still wondering if this is guaranteed and documented.
See also this related discussion.
Assuming you are using a posix file system, then yes.
FileInputStream and FileOutputStream on *nix use the read and write system calls internally. The documentation for write says that reads will see the results of past writes,
After a write() to a regular file has successfully returned:
Any successful read() from each byte position in the file that was
modified by that write shall return the data specified by the write()
for that position until such byte positions are again modified.
I'm pretty sure ntfs on windows will have the same read() write() guarantees.
You can't talk about "happens-before" relationship in terms of the Java memory model between your FileInputStream and FileOutputStream objects since they don't share any memory or thread. VM is free to reorder them just honoring your synchronization requirements. When you have proper synchronization between reads and writes without application level buffering, you are safe.
However FileInputStream and FileOutputStream share a file, which leaves things up to the OS which in main stream ones you can expect to read after write in order.
If FileOutputStream does not override flush(), then I think you can be sure all data written by write() can be read by read(), unless your OS does something weird with the data (like starting a new thread that waits for the hard drive to spin at the right speed instead of blocking, etc) so that it is not written immediately.
No, you need to flush() the Streams (at least for Buffered(Input|Output)Streams), otherwise you could have data in a buffer.
Maybe you need a concurrent data structure?
I have a program that generates a lot of data and puts it in a queue to write but the problem is its generating data faster than I'm currently writing(causing it to max memory and start to slow down). Order does not matter as I plan to parse the file later.
I looked around a bit and found a few questions that helped me design my current process(but I still find it slow). Here's my code so far:
//...background multi-threaded process keeps building the queue..
FileWriter writer = new FileWriter("foo.txt",true);
BufferedWriter bufferWritter = new BufferedWriter(writer);
while(!queue_of_stuff_to_write.isEmpty()) {
String data = solutions.poll().data;
bufferWritter.newLine();
bufferWritter.write(data);
}
bufferWritter.close();
I'm pretty new to programming so I maybe assessing this wrong(maybe a hardware issue as I'm using EC2), but is there a to very quickly dump the queue results into a file or if my approach is okay can I improve it somehow? As order does not matter, does it make more sense to write to multiple files on multiple drives? Will threading make it faster?,etc..I'm not exactly sure the best approach and any suggestions would be great. My goal is to save the results of the queue(sorry no outputting to /dev/null :-) and keep memory consumption as low as possible for my app(I'm not 100% sure but the queue fills up 15gig, so I'm assuming it'll be a 15gig+ file).
Fastest way to write huge data in text file Java (realized I should use buffered writer)
Concurrent file write in Java on Windows (made me see that maybe multi-threading writes wasn't a great idea)
Looking at that code, one thing that springs to mind is character encoding. You're writing strings, but ultimately, it's bytes that go to the streams. A writer character-to-byte encoding under the hood, and it's doing it in the same thread that is handling writing. That may mean that there is time being spent encoding that is delaying writes, which could reduce the rate at which data is written.
A simple change would be to use a queue of byte[] instead of String, do the encoding in the threads which push onto the queue, and have the IO code use a BufferedOutputStream rather than a BufferedWriter.
This may also reduce memory consumption, if the encoded text takes up less than two bytes per character on average. For latin text and UTF-8 encoding, this will usually be true.
However, i suspect it's likely that you're simply generating data faster than your IO subsystem can handle it. You will need to make your IO subsystem faster - either by using a faster one (if you're on EC2, perhaps renting a faster instance, or writing to a different backend - SQS vs EBS vs local disk, etc), or by ganging several IO subsystems together in parallel somehow.
Yes, writing multiple files on multiple drives should help, and if nothing else is writing to those drives at the same time, performance should scale linearly with the number of drives until I/O is no longer the bottleneck. You could also try a couple other optimizations to boost performance even more.
If you're generating huge files and the disk simply can't keep up, you can use a GZIPOutputStream to shrink the output--which, in turn, will reduce the amount of disk I/O. For non-random text, you can usually expect a compression ratio of at least 2x-10x.
//...background multi-threaded process keeps building the queue..
OutputStream out = new FileOutputStream("foo.txt",true);
OutputStreamWriter writer = new OutputStreamWriter(new GZIPOutputStream(out));
BufferedWriter bufferWriter = new BufferedWriter(writer);
while(!queue_of_stuff_to_write.isEmpty()) {
String data = solutions.poll().data;
bufferWriter.newLine();
bufferWriter.write(data);
}
bufferWriter.close();
If you're outputting regular (i.e., repetitive) data, you might also want to consider switching to a different output format--for example, a binary encoding of the data. Depending on the structure of your data, it might be more efficient to store it in a database. If you're outputting XML and really want to stick to XML, you should look into a Binary XML format, such as EXI or Fast InfoSet.
I guess as long as you produce your data out of calculations and do not load your data from another data source, writing will always be slower than generating your data.
You can try writing your data in multiple files (not in the same file -> due to synchronization problems) in multiple threads (but I guess that will not fix your problem).
Is it possible for you to wait for the writing part of your application to finish its operation and continue your calculations?
Another approach is:
Do you empty your queue? Does solutions.poll() reduce your solutions queue?
writing to different files using multiple threads is a good idea. Also, you should look into setting the BufferedWriters buffer size, which you can do from the constructor. Try initializing with a 10 Mb buffer and see if that helps