I'm writing a program which takes in a byte array of potentially millions of bytes, reads each one from a ByteArrayInputStream, and if the byte is not "printable" (ascii 32-126), that byte is encoded in a certain way and written to a ByteArrayOutputStream instance; if the byte is "printable" it is directly written to that same ByteArrayOutputStream instance.
So from a broader view I am taking in a byte array, and getting back a similar byte array except certain characters have been encoded.
My question is: would it be faster to write my data out to a file or to continuously be writing to this OutputStream?
It will be faster to write the data to your output stream. Writing to a file will involve disk access, which is slower than access to the RAM where the byte array inside the ByteArrayOutputStream lives.
However, if you eventually want to write your byte array out to some other place (say a file) then the intermediate step of the ByteArrayOutputStream is unnecessary and you should just write straight to the end destination e.g. FileOutputStream.
Related
Can you explain one thihg, when a do something like that:
FileInputStream fis1 = new FileInputStream(path1);
FileInputStream fis2 = new FileInputStream(path2);
byte[] array=new byte[fis1.available()+fis2.available()];
And if i want to write bytes to array :
fis2.read(array);
fis1.read(array);
What it will (method read()) do? It will write ALL bytes to array from both streams or no?
How bytes and in what order will be written in the array? Didnt find in spec and docs.
The read(byte[] b) method javadoc says:
Reads up to b.length bytes of data from this input stream into an array of bytes. This method blocks until some input is available.
Returns: the total number of bytes read into the buffer, or -1 if there is no more data because the end of the file has been reached.
What it means is it reads "some" number of bytes into the beginning of the array.
How many bytes does it read? The method returns the number of bytes it read. It reads at most the full length of the array, but it will likely be an amount in the range of a few kilobytes at most. The exact details depend on the operating system and file system implementation.
It does not read all bytes from the file, and it does not guarantee the byte array is filled entirely. If you call it twice, it does not return the same data twice.
I understand what a buffer is when writing to a file - OS-file-writing (calling native API - one method call for one char) is costly, so many chars/bytes are collected in a buffer and a buffer is written to file with one OS API call.
But what buffer is meant here? And why?
ByteArrayOutputStream(int size) - Creates a new byte array output
stream, with a buffer capacity of the specified size, in bytes.
ByteArrayOutputStream() has 32 bytes buffer by default, that is why Apache Commons have exactly same class org.apache.commons.io.output.ByteArrayOutputStream different only by buffer size and mechanism: "The original implementation only allocates 32 bytes at the beginning. As this class is designed for heavy duty it starts at 1024 bytes. In contrast to the original it doesn't reallocate the whole memory block but allocates additional buffers. This way no buffers need to be garbage collected and the contents don't have to be copied to the new buffer. This class is designed to behave exactly like the original. "
Besides in ByteArrayInputStream(byte[] buf) as I understand that "buf" (buffer) is actually a source of data (bytes) to be fed into InputStream (ByteArrayInputStream emulates InputStream from byte array), so the word buffer here is confusing in my opinion.
This class implements an output stream in which the data is written into a byte array. The buffer automatically grows as data is written to it.
The two bold terms are synonymous. The buffer is the byte[] array that holds the bytes written to the stream.
The buffer size is analogous to the capacity of an ArrayList. If you write more than 32 bytes to the stream then it has to grow the buffer, which involves allocating a new array and copying the bytes from old to new. A default "capacity" of 32 is inefficient if you know you'll be writing more than that.
The Javadoc says:
This class implements an output stream in which the data is written into a byte array. The buffer automatically grows as data is written to it.
So in the space of two sentences, it has used two different terms. There are numerous other examples in the same doc.
On the one hand, this might be confusing if you don't know that they are referring to the same thing; it might be clearer if it said something like:
is written to a buffer, implemented as a byte array.
But I think it's a simple fact that, once you know (or assume, since this is quite a common thing) they refer to the same thing, is no longer especially confusing.
I have a project where we write a small amount of data to a file every 5 minutes. The idea is to look at how this data changes over a period of hours, days, and weeks.
One of the requirements is to store this data in a secure format. We already have an encryption scheme for sending this data across a network as a byte[] array via DataI/O streams.
The question I have is this, is there a way to write encrypted byte[] arrays to a text file in such a way that I can read them back out? My biggest problem at the moment is that I'm reading Strings from the files, which messes up the byte[] arrays.
Any thoughts or pointers on where to go?
What you need to do is take your data and put it into a byte array. Then once it is in a byte array, you can encrypt it using an encryption algorithm. Then you write it to the file.
When you want to get the original data back, you have to read the byte array from the file, then decrypt the byte array and then you will have your original data. You cannot just read this data as a string because your encryption algorithm will create bytes that cannot be represented as regular chars so your data will get messed up.
Just make sure you read the encrypted data as a byte array and not a string, that is where you are having a problem.
If you want to write multiple byte arrays to a single file, then you should probably do something like this since you are using Java:
writer.print(arr.length);
writer.print(arr);
writer.flush();
Do this for each byte array. Then when you read the byte arrays back:
int length = reader.readInt();
byte[] bytes = new byte[length];
// fill array
This way the file can be structured like this:
[length of following array][array][length of second array][second array]
You will be able to put all of the byte arrays back to back, and since each array starts with the length of the array, you will know how much data needs to be put into each array.
See How to append to AES encrypted file for an example of an AES+CBC Java example which allows opening an already encrypted file and appending more encrypted data to in, while not requiring any special handling when decrypting it since it looks just like it would if the entire file had been encrypted just once.
I noticed when I use readFully() on a file instead of the read(byte[]), processing time is reduced greatly. However, it occured to me that readFully may be a double edged sword. If I accidentlly try to read in a huge, multi-gigabyte file, it could choke?
Here is a function I am using to generate an SHA-256 checksum:
public static byte[] createChecksum(File log, String type) throws Exception {
DataInputStream fis = new DataInputStream(new FileInputStream(log));
Long len = log.length();
byte[] buffer = new byte[len.intValue()];
fis.readFully(buffer); // TODO: readFully may come at the risk of
// choking on a huge file.
fis.close();
MessageDigest complete = MessageDigest.getInstance(type);
complete.update(buffer);
return complete.digest();
}
If I were to instead use:
DataInputStream fis = new DataInputStream(new BufferedInputStream(new FileInputStream(log)));
Would that allieviate this risk? Or... is the best option (in situations where you can't garuntee data size) to always control the amount of bytes read in and use a loop till all bytes are read?
(Come to think of it, since the MessageDigest API takes in the full byte array at once, I'm not sure how to attain a checksum without stuffing all the data in at once, but I suppose that is another question for another thread.
You should just allocate a decently-sized buffer (65536 bytes perhaps), and do a loop where you read 64kb at a time, using "complete.update()" to append to the digester inside the loop. Be careful on the last block so you only process the number of bytes read (probably less than 64kb)
Reading the file will take as long as it takes, whether you use readFully() or not.
Whether you can actually allocate gigabyte-sized byte arrays is another question. There is no need to use readFully() at all when downloading files. It's for use in wire protocols where say the next 12 bytes are an identifier followed by another 60 bytes of address information and you don't want to have to keep writing loops.
readFully() isn't going to choke if the file is multiple gigabytes, but allocating that byte buffer will. You'll get an out-of-memory exception before you ever get to the call to readFully().
You need to use the method of updating the hash with chunks of the file repeatedly, rather than updating it all at once with the entire file.
Is this:
ByteBuffer buf = ByteBuffer.allocate(1000);
...the only way to initialize a ByteBuffer?
What if I have no idea how many bytes I need to allocate..?
Edit: More details:
I'm converting one image file format to a TIFF file. The problem is the starting file format can be any size, but I need to write the data in the TIFF to little endian. So I'm reading the stuff I'm eventually going to print to the TIFF file into the ByteBuffer first so I can put everything in Little Endian, then I'm going to write it to the outfile. I guess since I know how long IFDs are, headers are, and I can probably figure out how many bytes in each image plane, I can just use multiple ByteBuffers during this whole process.
The types of places that you would use a ByteBuffer are generally the types of places that you would otherwise use a byte array (which also has a fixed size). With synchronous I/O you often use byte arrays, with asynchronous I/O, ByteBuffers are used instead.
If you need to read an unknown amount of data using a ByteBuffer, consider using a loop with your buffer and append the data to a ByteArrayOutputStream as you read it. When you are finished, call toByteArray() to get the final byte array.
Any time when you aren't absolutely sure of the size (or maximum size) of a given input, reading in a loop (possibly using a ByteArrayOutputStream, but otherwise just processing the data as a stream, as it is read) is the only way to handle it. Without some sort of loop, any remaining data will of course be lost.
For example:
final byte[] buf = new byte[4096];
int numRead;
// Use try-with-resources to auto-close streams.
try(
final FileInputStream fis = new FileInputStream(...);
final ByteArrayOutputStream baos = new ByteArrayOutputStream()
) {
while ((numRead = fis.read(buf)) > 0) {
baos.write(buf, 0, numRead);
}
final byte[] allBytes = baos.toByteArray();
// Do something with the data.
}
catch( final Exception e ) {
// Do something on failure...
}
If you instead wanted to write Java ints, or other things that aren't raw bytes, you can wrap your ByteArrayOutputStream in a DataOutputStream:
ByteArrayOutputStream baos = new ByteArrayOutputStream();
DataOutputStream dos = new DataOutputStream(baos);
while (thereAreMoreIntsFromSomewhere()) {
int someInt = getIntFromSomewhere();
dos.writeInt(someInt);
}
byte[] allBytes = baos.toByteArray();
Depends.
Library
Converting file formats tends to be a solved problem for most problem domains. For example:
Batik can transcode between various image formats (including TIFF).
Apache POI can convert between office spreadsheet formats.
Flexmark can generate HTML from Markdown.
The list is long. The first question should be, "What library can accomplish this task?" If performance is a consideration, your time is likely better spent optimising an existing package to meet your needs than writing yet another tool. (As a bonus, other people get to benefit from the centralised work.)
Known Quantities
Reading a file? Allocate file.size() bytes.
Copying a string? Allocate string.length() bytes.
Copying a TCP packet? Allocate 1500 bytes, for example.
Unknown Quantities
When the number of bytes is truly unknown, you can do a few things:
Make a guess.
Analyze example data sets to buffer; use the average length.
Example
Java's StringBuffer, unless otherwise instructed, uses an initial buffer size to hold 16 characters. Once the 16 characters are filled, a new, longer array is allocated, and then the original 16 characters copied. If the StringBuffer had an initial size of 1024 characters, then the reallocation would not happen as early or as often.
Optimization
Either way, this is probably a premature optimization. Typically you would allocate a set number of bytes when you want to reduce the number of internal memory reallocations that get executed.
It is unlikely that this will be the application's bottleneck.
The idea is that it's only a buffer - not the whole of the data. It's a temporary resting spot for data as you read a chunk, process it (possibly writing it somewhere else). So, allocate yourself a big enough "chunk" and it normally won't be a problem.
What problem are you anticipating?