How to clone input stream but still re-use original - java

I am trying to copy the InputStream from a URLConnection which is returning a stream of type HttpInputStream (inner class of HttpUrlConnection)
In other cases, I can copy the original stream to a ByteArrayOutputStream and then use mark/reset on the original, but HttpInputStream does not support mark/reset.
Is there a way I can still copy the stream and reset the original or keep it from being consumed? The original stream inside URLConnection has to be readable because it is passed into another library. I only need to copy the stream so I can read the first two lines of data. Here is what I have for streams that support mark/reset:
InputStream input = null;
ByteArrayOutputStream baos = new ByteArrayOutputStream();
try {
input = connection.getInputStream();
byte[] buffer = new byte[200];
input.mark(200);
int len = input.read(buffer);
input.reset();
baos.write(buffer, 0, len);
baos.flush();
String content = baos.toString("UTF-8");
//I set flags based on the value of content, but omitting here for the sake of simplicity.
} catch (IOException ex) {
//I do stuff here, but omitting for sake of simplicity in this
}

ImputStreams are not generally cloneable, and neither do all streams support mark/reset. There are some possible workarounds within the standard JRE.
Wrap the InputStream into a BufferedInputStream. That one supports mark/reset within the limits of its buffer size. That enables you to read a limited amount of data from the beginning, then reset the stream.
Another alternative is PushBackInputStream, which allows you to "unread" data previously read. You need to buffer the data to be pushed back yourself though, so it may be a bit inconvinient to handle.
If the whole stream isn't terribly big, you could also read the entire stream first, then construct as many ByteArrayInputStreams as needed from the pre-read data. Only feasible if the data fits in the heap (e.g. less than approximately 2GB max).

Apache commons library has a really nice TeeInput stream.
https://commons.apache.org/proper/commons-io/javadocs/api-1.4/org/apache/commons/io/input/TeeInputStream.html

Related

Why to use ByteArrayInputStream rather than byte[] in Java

As I understand ByteArrayInputStream is used to read byte[] data.
Why should I use it rather than simple byte[] (for example reading it from DB).
What is the different between them?
If the input is always a byte[], then you're right, there's often no need for the stream. And if you don't need it, don't use it. One additional advantage of a ByteArrayInputStream is that it serves as a very strong indication that you intend the bytes to be read-only (since the stream doesn't provide an interface for changing them), though it's important to note that a programmer can often still access the bytes directly, so you shouldn't use that in a situation where security is a concern.
But if it's sometimes a byte[], sometimes a file, sometimes a network connection, etc, then you need some sort of abstraction for "a stream of bytes, and I don't care where they come from." That's what an InputStream is. When the source happens to be a byte array, ByteArrayInputStream is a good InputStream to use.
This is helpful in many situations, but to give two concrete examples:
You're writing a library that takes bytes and processes them somehow (maybe it's an image processing library, for instance). Users of your library may supply bytes from a file, or from a byte[] in memory, or from some other source. So, you provide an interface that accepts an InputStream — which means that if what they have is a byte[], they need to wrap it in a ByteArrayInputStream.
You're writing code that reads a network connection. But to unit test that code, you don't want to have to open up a connection; you want to just supply some bytes in the code. So the code takes an InputStream, and your test provides a ByteArrayInputStream.
A ByteArrayInputStream contains an internal buffer that contains bytes that
may be read from the stream. An internal counter keeps track of the next byte to be supplied by the read method.
ByteArrayInputStream is like wrapper which protects underlying array from external modification
It has high order read ,mark ,skip functions
A stream also has the advantage that you don't have to have all bytes in memory at the same time, which is convenient if the size of the data is large and can easily be handled in small chunks.
Reference doc
Where as if you choose byte[] ,then you have to generate wheels to do reading ,skipping and track current index explicitly
byte data[] = { 65, 66, 67, 68, 69 }; // data
for (int index = 0; index < data.length; index++) {
System.out.print((char) data[index] + " ");
}
int c = 0;
ByteArrayInputStream bInput = new ByteArrayInputStream(data);
while ((bInput.read()) != -1) {
System.out.println(Character.toUpperCase((char) c));
}
ByteArrayInputStream is a good wrapper for byte[], the core is understanding stream, a stream is an ordered sequence of bytes of indeterminate length.Input streams move bytes of data into a
java program from some generally external source, in java io, you can decorate one stream to another stream to get more function. but the performance maybe bad. the power of the stream metaphor is that difference between these source and destinations are abstracted way,all input and output operations are simply traded as streams using the same class and the same method,you don not learn a new API for every different kind of device, the same API that read file can read network sockets,serial ports, Bluetooth transmissions, and more.

Reading a file larger than 2GB into memory in Java

Since ByteArrayInputStream is limited to 2GB, is there any alternate solution that allows me to store the whole contents of a 2.3GB (and possibly larger) file into an InputStream to be read by Stax2?
Current code:
XMLInputFactory xmlInputFactory = XMLInputFactory.newInstance();
XMLStreamReader xmlStreamReader = xmlInputFactory.createXMLStreamReader(in); //ByteArrayInputStream????
try
{
SchemaFactory factory = SchemaFactory.newInstance("http://www.w3.org/2001/XMLSchema");
Schema schema = factory.newSchema(new StreamSource(schemaInputStream));
Validator validator = schema.newValidator();
validator.validate(new StAXSource(xmlStreamReader));
}
finally
{
xmlStreamReader.close();
}
For performance tuning, variable in must not come from disk. I have plenties of RAM.
The whole point of StAX2 is that you do not need to read the file in to memory. You can just supply the source, and let the StAX StreamReader pull the data as it needs to.
What additional constraints do you have that you are not showing in your question?
If you have lots of memory, and you want to get good performance, just wrap your InputStream with a large byte buffer, and let the buffer do the buffering for you:
// 4 meg buffer on the stream
InputStream buffered = new BufferedInputStream(schemaInputStream, 1024 * 1024 * 4);
An alternative to solving this in Java is to create a RAMDisk, and to store the file on that, which would remove the problem from Java, where your basic limitation is that you can only have just less than Integer.MAX_VALUE values in a single array.
Use NIO to read the file into a gigantic ByteBuffer, and then create a stream class that reads the ByteBuffer. There are several such floating around in open sources.
If you have huge quantities of memory, you really won't get any performance improvement anyway. It's only getting read in once either way, and the disk cache will ensure it gets done optimally. Just use a disk-based input stream.
You can use memory writing the data compressed to a
ByteArrayOutputStream baos = new ByteArrayOutputStream
... new GZIPOutputStream(baos));
byte[] bytes = baos.toByteArray(); // < 100 MB?
ByteArrayInputStream ....
And then later wrap the input stream in a GZIPInputStream.
Still a minor slow down, but should be ideal for XML.

Java file IO truncated while reading large files using BufferedInputStream

I have a function in which I am only given a BufferedInputStream and no other information about the file to be read. I unfortunately cannot alter the method definition as it is called by code I don't have access to. I've been using the code below to read the file and place its contents in a String:
public String[] doImport(BufferedInputStream stream) throws IOException, PersistenceException {
int bytesAvail = stream.available();
byte[] bytesRead = new byte[bytesAvail];
stream.read(bytesRead);
stream.close();
String fileContents = new String(bytesRead);
//more code here working with fileContents
}
My problem is that for large files (>2Gb), this code causes the program to either run extremely slowly or truncate the data, depending on the computer the program is executed on. Does anyone have a recommendation regarding how to deal with large files in this situation?
You're assuming that available() returns the size of the file; it does not. It returns the number of bytes available to be read, and that may be any number less than or equal to the size of the file.
Unfortunately there's no way to do what you want in just one shot without having some other source of information on the length of the file data (i.e., by calling java.io.File.length()). Instead, you have to possibly accumulate from multiple reads. One way is by using ByteArrayOutputStream. Read into a fixed, finite-size array, then write the data you read into a ByteArrayOutputStream. At the end, pull the byte array out. You'll need to use the three-argument forms of read() and write() and look at the return value of read() so you know exactly how many bytes were read into the buffer on each call.
I'm not sure why you don't think you can read it line-by-line. BufferedInputStream only describes how the underlying stream is accessed, it doesn't impose any restrictions on how you ultimately read data from it. You can use it just as if it were any other InputStream.
Namely, to read it line-by-line you could do
InputStreamReader streamReader = new InputStreamReader(stream);
BufferedInputReader lineReader = new BufferedInputReader(streamReader);
String line = lineReader.readLine();
...
[Edit] This response is to the original wording of the question, which asked specifically for a way to read the input file line-by-line.

How to initialize a ByteBuffer if you don't know how many bytes to allocate beforehand?

Is this:
ByteBuffer buf = ByteBuffer.allocate(1000);
...the only way to initialize a ByteBuffer?
What if I have no idea how many bytes I need to allocate..?
Edit: More details:
I'm converting one image file format to a TIFF file. The problem is the starting file format can be any size, but I need to write the data in the TIFF to little endian. So I'm reading the stuff I'm eventually going to print to the TIFF file into the ByteBuffer first so I can put everything in Little Endian, then I'm going to write it to the outfile. I guess since I know how long IFDs are, headers are, and I can probably figure out how many bytes in each image plane, I can just use multiple ByteBuffers during this whole process.
The types of places that you would use a ByteBuffer are generally the types of places that you would otherwise use a byte array (which also has a fixed size). With synchronous I/O you often use byte arrays, with asynchronous I/O, ByteBuffers are used instead.
If you need to read an unknown amount of data using a ByteBuffer, consider using a loop with your buffer and append the data to a ByteArrayOutputStream as you read it. When you are finished, call toByteArray() to get the final byte array.
Any time when you aren't absolutely sure of the size (or maximum size) of a given input, reading in a loop (possibly using a ByteArrayOutputStream, but otherwise just processing the data as a stream, as it is read) is the only way to handle it. Without some sort of loop, any remaining data will of course be lost.
For example:
final byte[] buf = new byte[4096];
int numRead;
// Use try-with-resources to auto-close streams.
try(
final FileInputStream fis = new FileInputStream(...);
final ByteArrayOutputStream baos = new ByteArrayOutputStream()
) {
while ((numRead = fis.read(buf)) > 0) {
baos.write(buf, 0, numRead);
}
final byte[] allBytes = baos.toByteArray();
// Do something with the data.
}
catch( final Exception e ) {
// Do something on failure...
}
If you instead wanted to write Java ints, or other things that aren't raw bytes, you can wrap your ByteArrayOutputStream in a DataOutputStream:
ByteArrayOutputStream baos = new ByteArrayOutputStream();
DataOutputStream dos = new DataOutputStream(baos);
while (thereAreMoreIntsFromSomewhere()) {
int someInt = getIntFromSomewhere();
dos.writeInt(someInt);
}
byte[] allBytes = baos.toByteArray();
Depends.
Library
Converting file formats tends to be a solved problem for most problem domains. For example:
Batik can transcode between various image formats (including TIFF).
Apache POI can convert between office spreadsheet formats.
Flexmark can generate HTML from Markdown.
The list is long. The first question should be, "What library can accomplish this task?" If performance is a consideration, your time is likely better spent optimising an existing package to meet your needs than writing yet another tool. (As a bonus, other people get to benefit from the centralised work.)
Known Quantities
Reading a file? Allocate file.size() bytes.
Copying a string? Allocate string.length() bytes.
Copying a TCP packet? Allocate 1500 bytes, for example.
Unknown Quantities
When the number of bytes is truly unknown, you can do a few things:
Make a guess.
Analyze example data sets to buffer; use the average length.
Example
Java's StringBuffer, unless otherwise instructed, uses an initial buffer size to hold 16 characters. Once the 16 characters are filled, a new, longer array is allocated, and then the original 16 characters copied. If the StringBuffer had an initial size of 1024 characters, then the reallocation would not happen as early or as often.
Optimization
Either way, this is probably a premature optimization. Typically you would allocate a set number of bytes when you want to reduce the number of internal memory reallocations that get executed.
It is unlikely that this will be the application's bottleneck.
The idea is that it's only a buffer - not the whole of the data. It's a temporary resting spot for data as you read a chunk, process it (possibly writing it somewhere else). So, allocate yourself a big enough "chunk" and it normally won't be a problem.
What problem are you anticipating?

Writing large strings with DataOutputStream

I've been doing some socket programming to transmit information across the wire. I've run into a problem with DataOutputStream.writeUTF(). It seems to allow strings of up to 64k but I have a few situations where I can run over this. Are there any good alternatives that support larger strings or do I need to roll my own?
It actually uses a two bytes to write the length of the string before using an algorithm that compacts it into one, two or three bytes per character. (See the documentation on java.io.DataOutput) It is close to UTF-8, but even though documented as being so, there are compatibility problems. If you are not terribly worried about the amount of data you will be writing, you can easily write your own by writing the length of the string first, and then the raw data of the string using the getBytes method.
// Write data
String str="foo";
byte[] data=str.getBytes("UTF-8");
out.writeInt(data.length);
out.write(data);
// Read data
int length=in.readInt();
byte[] data=new byte[length];
in.readFully(data);
String str=new String(data,"UTF-8");
ObjectOutputStream.writeObject() properly handles long strings (verified by looking at the source code). Write the string out this way:
ObjectOutputStream oos = new ObjectOutputStream(out);
... other write operations ...
oos.writeObject(myString);
... other write operations ...
Read it this way:
ObjectInputStream ois = new ObjectInputStream(in);
... other read operations ...
String myString = (String) ois.readObject();
... other read operations ...
Another difference with DataOutputStream is that using ObjectOutputStream automatically writes a 4-byte stream header when instantiated, but its usually going to be a pretty small penalty to pay.
You should be able to use OutputStreamWriter with the UTF-8 encoding. There's no explicit writeUTF method, but you can set the charset in the constructor. Try
Writer osw = new OutputStreamWriter(out, "UTF-8");
where out is whatever OutputStream you're wrapping now.

Categories

Resources