There are many java standard and 3rd party libraries that in their public API, there are methods for writing to or reading from Stream.
One example is javax.imageio.ImageIO.write() that takes OutputStream to write the content of a processed image to it.
Another example is iText pdf processing library that takes OutputStream to write the resulting pdf to it.
Third example is AmazonS3 Java API, which takes InputStream so that will read it and create file in thir S3 storage.
The problem araises when you want to to combine two of these. For example, I have an image as BufferedImage for which i have to use ImageIO.write to push the result in OutputStream.
But there is no direct way to push it to Amazon S3, as S3 requires InputStream.
There are few ways to work this out, but subject of this question is usage of ByteArrayOutputStream.
The idea behind ByteArrayOutputStream is to use an intermidiate byte array wrapped in Input/Output Stream so that the guy that wants to write to output stream will write to the array and the guy that wants to read, will read the array.
My wondering is why ByteArrayOutputStream does not allow any access to the byte array without copying it, for example, to provide an InputStream that has direct access to it.
The only way to access it is to call toByteArray(), that will make a copy of the internal array (the standard one). Which means, in my image example, i will have three copies of the image in the memory:
First is the actual BufferedImage,
second is the internal array of the OutputStream and
third is the copy produced by toByteArray() so I can create the
InputStream.
How this design is justified?
Hiding implementation? Just provide getInputStream(), and the implementation stays hidden.
Multi-threading? ByteArrayOutputStream is not suited for access by multiple threads anyway, so this can not be.
Moreover, there is second flavor of ByteArrayOutputStream, provided by Apache's commons-io library (which has a different internal implementation).
But both have exactly the same public interface that does not provide way to access the byte array without copying it.
My wondering is why ByteArrayOutputStream does not allow any access to the byte array without coping it, for example, to provide an InputStream that has direct access to it.
I can think of four reasons:
The current implementation uses a single byte array, but it could also be implemented as a linked list of byte arrays, deferring the creation of the final array until the application asks for it. If the application could see the actual byte buffer, it would have to be a single array.
Contrary to your understanding ByteArrayOutputStream is thread safe, and is suitable for use in multi-threaded applications. But if direct access was provided to the byte array, it is difficult to see how that could be synchronized without creating other problems.
The API would need to be more complicated because the application also needs to know where the current buffer high water mark is, and whether the byte array is (still) the live byte array. (The ByteArrayOutputStream implementation occasionally needs to reallocate the byte array ... and that will leave the application holding a reference to an array that is no longer the array.)
When you expose the byte array, you allow an application to modify the contents of the array, which could be problematic.
How this design is justified?
The design is tailored for simpler use-cases than yours. The Java SE class libraries don't aim to support all possible use-cases. But they don't prevent you (or a 3rd party library) from providing other stream classes for other use-cases.
The bottom line is that the Sun designers decided NOT to expose the byte array for ByteArrayOutputStream, and (IMO) you are unlikely to change their minds.
(And if you want to try, this is not the right place to do it.
Try submitting an RFE via the Bugs database.
Or develop an patch that adds the functionality and submit it to the OpenJDK team via the relevant channels. You would increase your chances if you included comprehensive unit tests and documentation.)
You might have more success convincing the Apache Commons IO developers of the rightness of your arguments, provided that you can come up with an API design that isn't too dangerous.
Alternatively, there's nothing stopping you from just implementing your own special purpose version that exposes its internal data structures. The code is GPL'ed so you can copy it ... subject to the normal GPL rules about code distribution.
Luckily, the internal array is protected, so you can subclass it, and wrap a ByteArrayInputStream around it, without any copying.
I think that the behavior you are looking for is a Pipe. A ByteArrayOutputStream is just an OutputStream, not an input/output stream. It wasn't designed for what you have in mind.
Related
We can get a BufferedInputStream by decorating an FileInputStream. And Channel got from FileInputStream.getChannel can also read content into a Buffer.
So, What's the difference between BufferedInputStream and java.nio.Buffer? i.e., when should I use BufferedInputStream and when should I use java.nio.Buffer and java.nio.Channel?
Getting started with new I/O (NIO), an article excerpt:
A stream-oriented I/O system deals with data one byte at a time. An
input stream produces one byte of data, and an output stream consumes
one byte of data. It is very easy to create filters for streamed data.
It is also relatively simply to chain several filters together so that
each one does its part in what amounts to a single, sophisticated
processing mechanism. On the flip side, stream-oriented I/O is often
rather slow.
A block-oriented I/O system deals with data in blocks. Each operation
produces or consumes a block of data in one step. Processing data by
the block can be much faster than processing it by the (streamed)
byte. But block-oriented I/O lacks some of the elegance and simplicity
of stream-oriented I/O.
These classes where written at different times for different packages.
If you are working with classes in the java.io package use BufferedInputStream.
If you are using java.nio use the ByteBuffer.
If are not using either you could use a plain byte[]. ByteBuffer has some useful methods for working with primitives so you might use it for that.
It is unlikely there will be any confusion because in general you will only use one when you have to and in which case only one will compile when you do.
I think we use BufferedInputStream to wrap the InputStream to make it works like block-oriented. But when deal with too much data, it actually consume more time than the real block-oriented I/O (Channel), but still faster than the unwrapperd InputStream.
I am getting a hard time visualizing what exactly stream means in terms of IO. I imagine stream as a continuous flow of data coming from a file, socket or any other data source. Is that correct? But then I get confused on how our java programs react to stream because when we write any java code let say:
Customer getCustomer(Customer customer)
Doesn't the above java code expects the whole object to be present before it gets processed?
Now lets say we are reading from a stream something like
FileInputStream in = new FileInputStream("abc.txt")
in.read();
Doesn't in.read() expects the whole file to be present in memory to be processed. If it is, then how come it is a stream? Why do we call them streams? Do they process data as it is read?
A similar confusion when reading through hadoop streams, looks like they have a different meaning altogether.
The word "stream" is used for different things in different contexts. But you're specifically asking about streams in I/O, i.e. InputStream and OutputStream.
I imagine stream as a continuous flow of data coming from a file, socket or any other data source. Is that correct?
Yes, a stream is a source of a sequence of bytes, which may come from a file, socket etc.
About getCustomer: You need to have a Customer object to pass to that method. But calling methods, passing objects and getting objects returned really does not have anything to do with streams.
Doesn't in.read() expects the whole file to be present in memory to be processed.
No. FileInputStream is an object which represents the stream. It's the thing that knows how to read bytes from the file.
Streams are not a fundamentally special kind of object. It's not like that there are classes, objects and streams. Streams are just a concept that is implemented using the standard Java OO programming features (classes and objects).
Doesn't the above java code expects the whole object to be present before it gets processed?
Yes. But it's not a stream.
Doesn't in.read() expects the whole file to be present in memory to be processed.
No.
If it is
It isn't.
then how come it is a stream?
It is.
Why do we call them streams? Do they process data as it is read?
Yes.
A similar confusion
There is no confusion here, except your own confusion when comparing method calls with I/O streams, which comparing apples versus oranges.
when reading through hadoop streams, looks like they have a different meaning altogether.
Very possibly.
An I/O Stream represents an input source or an output destination. A stream can represent many different kinds of sources and destinations, including disk files, devices, other programs, and memory arrays.
Streams support many different kinds of data, including simple bytes, primitive data types, localized characters, and objects. Some streams simply pass on data; others manipulate and transform the data in useful ways.
No matter how they work internally, all streams present the same simple model to programs that use them: A stream is a sequence of data.
In Java there are two kinds of streams (Byte & Character), they differ in the way how the data is transferred between source & destination.
Hope this answers your question. Please let me know if you need any further information.
From here..
A Stream is a free flowing sequence of elements. They do not hold any storage as that responsibility lies with collections such as arrays, lists and sets. Every stream starts with a source of data, sets up a pipeline, processes the elements through a pipeline and finishes with a terminal operation. They allow us to parallelize the load that comes with heavy operations without having to write any parallel code. A new package java.util.stream was introduced in Java 8 to deal with this feature.
Streams adhere to a common Pipes and Filters software pattern. A pipeline is created with data events flowing through, with various intermediate operations being applied on individual events as they move through the pipeline. The stream is said to be terminated when the pipeline is disrupted with a terminal operation. Please keep in mind that a stream is expected to be immutable — any attempts to modify the collection during the pipeline will raise a ConcurrentModifiedException exception.
Suppose I'm using a Deflater to compress a stream of bytes, and at some intervals I have the option of feeding it with two different byte arrays (two alternative representations of the same info), so that I can choose the most compressible one. Ideally, I would like to be able to clone the state of a "live" deflater, so that I can feed each instance with an array, check the results, and discard the undesirable one.
Alternatively, I'd like to mark the current state (sort of a savepoint) so that, after feeding and compressing with setInput() + deflate() I can rollback/reset to that state to try with different data.
Looking at the API, this seems to me rather impossible... nor even reimplementing the Deflater (not at least if one wants to take advantage of the internal native implementation). Am I right? Any ideas or experiences?
It does not appear that the Java interface to zlib provides zlib's deflateCopy() operation. It is possible that the inherited clone operation is properly implemented and does a deflateCopy(), but I don't know.
What is the difference between a byte array & byte buffer ?
Also, in what situations should one be preferred over the other?
[my usecase is for a web application being developed in java].
There are actually a number of ways to work with bytes. And I agree that it's not always easy to pick the best one:
the byte[]
the java.nio.ByteBuffer
the java.io.ByteArrayOutputStream (in combination with other streams)
the java.util.BitSet
The byte[] is just a primitive array, just containing the raw data. So, it does not have convenient methods for building or manipulating the content.
A ByteBuffer is more like a builder. It creates a byte[]. Unlike arrays, it has more convenient helper methods. (e.g. the append(byte) method). It's not that straightforward in terms of usage. (Most tutorials are way too complicated or of poor quality, but this one will get you somewhere. Take it one step further? then read about the many pitfalls.)
You could be tempted to say that a ByteBuffer does to byte[], what a StringBuilder does for String. But there is a specific difference/shortcoming of the ByteBuffer class. Although it may appear that a bytebuffer resizes automatically while you add elements, the ByteBuffer actually has a fixed capacity. When you instantiate it, you already have to specify the maximum size of the buffer.
That's one of the reasons, why I often prefer to use the ByteArrayOutputStream because it automatically resizes, just like an ArrayList does. (It has a toByteArray() method). Sometimes it's practical, to wrap it in a DataOutputStream. The advantage is that you will have some additional convenience calls, (e.g. writeShort(int) if you need to write 2 bytes.)
BitSet comes in handy when you want to perform bit-level operations. You can get/set individual bits, and it has logical operator methods like xor(). (The toByteArray() method was only introduced in java 7.)
Of course depending on your needs you can combine all of them to build your byte[].
ByteBuffer is part of the new IO package (nio) that was developed for fast throughput of file-based data. Specifically, Apache is a very fast web server (written in C) because it reads bytes from disk and puts them on the network directly, without shuffling them through various buffers. It does this through memory-mapped files, which early versions of Java did not have. With the advent of nio, it became possible to write a web server in java that is as fast as Apache. When you want very fast file-to-network throughput, then you want to use memory mapped files and ByteBuffer.
Databases typically use memory-mapped files, but this type of usage is seldom efficient in Java. In C/C++, it's possible to load up a large chunk of memory and cast it to the typed data you want. Due to Java's security model, this isn't generally feasible, because you can only convert to certain native types, and these conversions aren't very efficient. ByteBuffer works best when you are just dealing with bytes as plain byte data -- once you need to convert them to objects, the other java io classes typically perform better and are easier to use.
If you're not dealing with memory mapped files, then you don't really need to bother with ByteBuffer -- you'd normally use arrays of byte. If you're trying to build a web server, with the fastest possible throughput of file-based raw byte data, then ByteBuffer (specifically MappedByteBuffer) is your best friend.
Those two articles may help you http://nadeausoftware.com/articles/2008/02/java_tip_how_read_files_quickly and http://evanjones.ca/software/java-bytebuffers.html
Learning Java, so be gentle please. Ideally I need to create an array of bytes that will point to a portion of a bigger array:
byte[] big = new byte[1000];
// C-style code starts
load(file,big);
byte[100] sub = big + 200;
// C-style code ends
I know this is not possible in Java and there are two work-arounds that come to mind and would include:
Either copying portion of big into sub iterating through big.
Or writting own class that will take a reference to big + offset + size and implementing the "subarray" through accessor methods using big as the actual underlying data structure.
The task I am trying to solve is to load a file into memory an then gain read-only access to the records stored withing the file through a class. The speed is paramount, hence ideally I'd like to avoid copying or accessor methods. And since I'm learning Java, I'd like to stick with it.
Any other alternatives I've got? Please do ask questions if I didn't explain the task well enough.
Creating an array as a "view" of an other array is not possible in Java. But you could use java.nio.ByteBuffer, which is basically the class you suggest in work-around #2. For instance:
ByteBuffer subBuf = ByteBuffer.wrap(big, 200, 100).slice().asReadOnlyBuffer();
No copying involved (some object creation, though). As a standard library class, I'd also assume that ByteBuffer is more likely to receive special treatment wrt. "JIT" optimizations by the JVM than a custom one.
If you want to read a file fast and with low-level access, check the java nio stuff. Here's an example from java almanac.
You can use a mapped byte buffer to navigate within the file content.
Take a look at the source for java.lang.String (it'll be in the src.zip or src.jar). You will see that they have a an array of cahrs and then an start and end. So, yes, the solution is to use a class to do it.
Here is a link to the source online.
The variables of interest are:
value
offset
count
substring is probably a good method to look at as a starting point.
If you want to read directlry from the file make use of the java.nio.channels.FileChannel class, specifically the map() method - that will let you use memory mapped I/O which will be very fast and use less memory than the copying to arrays.