Gets the uncompressed size of this GZIPInputStream?

Gets the uncompressed size of this GZIPInputStream? - java

I have a GZIPInputStream that I constructed from another ByteArrayInputStream. I want to know the original (uncompressed) length for the gzip data. Although I can read to the end of the GZIPInputStream, then count the number, it will cost much time and waste CPU. I would like to know the size before read it.
Is there a similiar method like ZipEntry.getSize() for GZIPInputStream:
public long getSize ()
Since: API Level 1
Gets the uncompressed size of this ZipEntry.

It is possible to determine the uncompressed size by reading the last four bytes of the gzipped file.
I found this solution here:
http://www.abeel.be/content/determine-uncompressed-size-gzip-file
Also from this link there is some example code (corrected to use long instead of int, to cope with sizes between 2GB and 4GB which would make an int wrap around):
RandomAccessFile raf = new RandomAccessFile(file, "r");
raf.seek(raf.length() - 4);
byte b4 = raf.read();
byte b3 = raf.read();
byte b2 = raf.read();
byte b1 = raf.read();
long val = ((long)b1 << 24) | ((long)b2 << 16) | ((long)b3 << 8) | (long)b4;
raf.close();
val is the length in bytes. Beware: you can not determine the correct uncompressed size, when the uncompressed file was greater than 4GB!

Based on #Alexander's answer:
RandomAccessFile raf = new RandomAccessFile(inputFilePath + ".gz", "r");
raf.seek(raf.length() - 4);
byte[] bytes = new byte[4];
raf.read(bytes);
fileSize = ByteBuffer.wrap(bytes).order(ByteOrder.LITTLE_ENDIAN).getInt();
if (fileSize < 0)
fileSize += (1L << 32);
raf.close();

Is there a similiar method like ZipEntry.getSize() for
GZIPInputStream
No. It's not in the Javadoc => it doesn't exist.
What do you need the length for?

There is no reliable way to get the length other than decompressing the whole thing. See Uncompressed file size using zlib's gzip file access function .

If you can guess at the compression ratio (a reasonable expectation if the data is similar to other data you've already processed), then you can work out the size of arbitrarily large files (with some error). Again, this assumes a file containing a single gzip stream. The following assumes the first size greater than 90% of the estimated size (based on estimated ratio) is the true size:
estCompRatio = 6.1;
RandomAccessFile raf = new RandomAccessFile(inputFilePath + ".gz", "r");
compLength = raf.length();
byte[] bytes = new byte[4];
raf.read(bytes);
uncLength = ByteBuffer.wrap(bytes).order(ByteOrder.LITTLE_ENDIAN).getInt();
raf.seek(compLength - 4);
uncLength = raf.readInt();
while(uncLength < (compLength * estCompRatio * 0.9)){
uncLength += (1L << 32);
}
[setting estCompRatio to 0 is equivalent to #Alexander's answer]

A more compact version of the calculation based on the 4 tail bytes (avoids using a byte buffer, calls Integer.reverseBytes to reverse the byte order of read bytes).
private static long getUncompressedSize(Path inputPath) throws IOException
{
long size = -1;
try (RandomAccessFile fp = new RandomAccessFile(inputPath.toFile(), "r")) {
fp.seek(fp.length() - Integer.BYTES);
int n = fp.readInt();
size = Integer.toUnsignedLong(Integer.reverseBytes(n));
}
return size;
}

Get the FileChannel from the underlying FileInputStream instead. It tells you both file size and current position of the compressed file. Example:
#Override
public void produce(final DataConsumer consumer, final boolean skipData) throws IOException {
try (FileInputStream fis = new FileInputStream(tarFile)) {
FileChannel channel = fis.getChannel();
final Eta<Long> eta = new Eta<>(channel.size());
try (InputStream is = tarFile.getName().toLowerCase().endsWith("gz")
? new GZIPInputStream(fis) : fis) {
try (TarArchiveInputStream tais = (TarArchiveInputStream) new ArchiveStreamFactory()
.createArchiveInputStream("tar", new BufferedInputStream(is))) {
TarArchiveEntry tae;
boolean done = false;
while (!done && (tae = tais.getNextTarEntry()) != null) {
if (tae.getName().startsWith("docs/") && tae.getName().endsWith(".html")) {
String data = null;
if (!skipData) {
data = new String(tais.readNBytes((int) tae.getSize()), StandardCharsets.UTF_8);
}
done = !consumer.consume(data);
}
String progress = eta.toStringPeriodical(channel.position());
if (progress != null) {
System.out.println(progress);
}
}
System.out.println("tar bytes read: " + tais.getBytesRead());
} catch (ArchiveException ex) {
throw new IOException(ex);
}
}
}
}

No, unfortunately if you wanted to get the uncompressed size, you would have to read the entire stream and increment a counter like you mention in your question. Why do you need to know the size? Could an estimation of the size work for your purposes?

Related

Trying to use BufferedInputStream and Base64 to Encode a large file in Java

I am new to the Java I/O so please help.
I am trying to process a large file(e.g. a pdf file of 50mb) using the apache commons library.
At first I try:
byte[] bytes = FileUtils.readFileToByteArray(file);
String encodeBase64String = Base64.encodeBase64String(bytes);
byte[] decoded = Base64.decodeBase64(encodeBase64String);
But knowing that the
FileUtils.readFileToByteArray in org.apache.commons.io will load the whole file into memory, I try to use BufferedInputStream to read the file piece by piece:
BufferedInputStream bis = new BufferedInputStream(inputStream);
StringBuilder pdfStringBuilder = new StringBuilder();
int byteArraySize = 10;
byte[] tempByteArray = new byte[byteArraySize];
while (bis.available() > 0) {
if (bis.available() < byteArraySize) { // reaching the end of file
tempByteArray = new byte[bis.available()];
}
int len = Math.min(bis.available(), byteArraySize);
read = bis.read(tempByteArray, 0, len);
if (read != -1) {
pdfStringBuilder.append(Base64.encodeBase64String(tempByteArray));
} else {
System.err.println("End of file reached.");
}
}
byte[] bytes = Base64.decodeBase64(pdfStringBuilder.toString());
However, the 2 decoded bytes array don't look quite the same... ... In fact, the only give 10 bytes, which is my temp array size... ...
Can anyone please help:
what am I doing it wrong to read the file piece by piece?
why is the decoded byte array only returns 10 bytes in the 2nd solution?
Thanks in advance:)

After some digging, it turns out that the byte array's size has to be multiple of 3 in order to avoid padding. After using a temp array size with multiple of 3, the program is able to go through.
I simply change
int byteArraySize = 10;
to be
int byteArraySize = 1024 * 3;

Java Reading large files into byte array chunk by chunk

So I've been trying to make a small program that inputs a file into a byte array, then it will turn that byte array into hex, then binary. It will then play with the binary values (I haven't thought of what to do when I get to this stage) and then save it as a custom file.
I studied a lot of internet code and I can turn a file into a byte array and into hex, but the problem is I can't turn huge files into byte arrays (out of memory).
This is the code that is not a complete failure
public void rundis(Path pp) {
byte bb[] = null;
try {
bb = Files.readAllBytes(pp); //Files.toByteArray(pathhold);
System.out.println("byte array made");
} catch (Exception e) {
e.printStackTrace();
}
if (bb.length != 0 || bb != null) {
System.out.println("byte array filled");
//send to method to turn into hex
} else {
System.out.println("byte array NOT filled");
}
}
I know how the process should go, but I don't know how to code that properly.
The process if you are interested:
Input file using File
Read the chunk by chunk of the file into a byte array. Ex. each byte array record hold 600 bytes
Send that chunk to be turned into a Hex value --> Integer.tohexstring
Send that hex value chunk to be made into a binary value --> Integer.toBinarystring
Mess around with the Binary value
Save to custom file line by line
Problem:: I don't know how to turn a huge file into a byte array chunk by chunk to be processed.
Any and all help will be appreciated, thank you for reading :)

To chunk your input use a FileInputStream:
Path pp = FileSystems.getDefault().getPath("logs", "access.log");
final int BUFFER_SIZE = 1024*1024; //this is actually bytes
FileInputStream fis = new FileInputStream(pp.toFile());
byte[] buffer = new byte[BUFFER_SIZE];
int read = 0;
while( ( read = fis.read( buffer ) ) > 0 ){
// call your other methodes here...
}
fis.close();

To stream a file, you need to step away from Files.readAllBytes(). It's a nice utility for small files, but as you noticed not so much for large files.
In pseudocode it would look something like this:
while there are more bytes available
read some bytes
process those bytes
(write the result back to a file, if needed)
In Java, you can use a FileInputStream to read a file byte by byte or chunk by chunk. Lets say we want to write back our processed bytes. First we open the files:
FileInputStream is = new FileInputStream(new File("input.txt"));
FileOutputStream os = new FileOutputStream(new File("output.txt"));
We need the FileOutputStream to write back our results - we don't want to just drop our precious processed data, right? Next we need a buffer which holds a chunk of bytes:
byte[] buf = new byte[4096];
How many bytes is up to you, I kinda like chunks of 4096 bytes. Then we need to actually read some bytes
int read = is.read(buf);
this will read up to buf.length bytes and store them in buf. It will return the total bytes read. Then we process the bytes:
//Assuming the processing function looks like this:
//byte[] process(byte[] data, int bytes);
byte[] ret = process(buf, read);
process() in above example is your processing method. It takes in a byte-array, the number of bytes it should process and returns the result as byte-array.
Last, we write the result back to a file:
os.write(ret);
We have to execute this in a loop until there are no bytes left in the file, so lets write a loop for it:
int read = 0;
while((read = is.read(buf)) > 0) {
byte[] ret = process(buf, read);
os.write(ret);
}
and finally close the streams
is.close();
os.close();
And thats it. We processed the file in 4096-byte chunks and wrote the result back to a file. It's up to you what to do with the result, you could also send it over TCP or even drop it if it's not needed, or even read from TCP instead of a file, the basic logic is the same.
This still needs some proper error-handling to work around missing files or wrong permissions but that's up to you to implement that.
A example implementation for the process method:
//returns the hex-representation of the bytes
public static byte[] process(byte[] bytes, int length) {
final char[] hexchars = "0123456789ABCDEF".toCharArray();
char[] ret = new char[length * 2];
for ( int i = 0; i < length; ++i) {
int b = bytes[i] & 0xFF;
ret[i * 2] = hexchars[b >>> 4];
ret[i * 2 + 1] = hexchars[b & 0x0F];
}
return ret;
}

java: read large binary file

I need to read out a given large file that contains 500000001 binaries. Afterwards I have to translate them into ASCII.
My Problem occurs while trying to store the binaries in a large array. I get the warning at the definition of the array ioBuf:
"The literal 16000000032 of type int is out of range."
I have no clue how to save these numbers to work with them! Has somebody an idea?
Here is my code:
public byte[] read(){
try{
BufferedInputStream in = new BufferedInputStream(new FileInputStream("data.dat"));
ByteArrayOutputStream bs = new ByteArrayOutputStream();
BufferedOutputStream out = new BufferedOutputStream(bs);
byte[] ioBuf = new byte[16000000032];
int bytesRead;
while ((bytesRead = in.read(ioBuf)) != -1){
out.write(ioBuf, 0, bytesRead);
}
out.close();
in.close();
return bs.toByteArray();
}

The maximum Index of an Array is Integer.MAX_VALUE and 16000000032 is greater than Integer.MAX_VALUE
Integer.MAX_VALUE = 2^31-1 = 2147483647
2147483647 < 16000000032
You could overcome this by checking if the Array is full and create another and continue reading.
But i'm not quite sure if your approach is the best way to perform this. byte[Integer_MAX_VALUE] is huge ;)
Maybe you can split the input file in smaller chunks process them.
EDIT: This is how you could read a single int of your file. You can resize the buffer's size to the amount of data you want to read. But you tried to read the whole file at once.
//Allocate buffer with 4byte = 32bit = Integer.SIZE
byte[] ioBuf = new byte[4];
int bytesRead;
while ((bytesRead = in.read(ioBuf)) != -1){
//if bytesRead == 4 you read 1 int
//do your stuff
}

If you need to declare a large constant, append an 'L' to it which indicates to the compiler that is a long constant. However, as mentioned in another answer you can't declare arrays that large.
I suspect the purpose of the exercise is to learn how to use the java.nio.Buffer family of classes.

I made some progress by starting from scratch! But I still have a problem.
My idea is to read up the first 32 bytes, convert them to a int number. Then the next 32 bytes etc. Unfortunately I just get the first and don't know how to proceed.
I discovered following method for converting these numbers to int:
public static int byteArrayToInt(byte[] b){
final ByteBuffer bb = ByteBuffer.wrap(b);
bb.order(ByteOrder.LITTLE_ENDIAN);
return bb.getInt();
}
so now I have:
BufferedInputStream in=null;
byte[] buf = new byte[32];
try {
in = new BufferedInputStream(new FileInputStream("ndata.dat"));
in.read(buf);
System.out.println(byteArrayToInt(buf));
in.close();
} catch (IOException e) {
System.out.println("error while reading ndata.dat file");
}

Divide the video to bytes

i want to convert the video to bytes it gives me result but i think the result is not correct because i test it for different videos and the gives me the same result so
can any one help please to do how to convert video to byte
String filename = "D:/try.avi";
byte[] myByteArray = filename.getBytes();
for(int i = 0; i<myByteArray.length;i ++)
{
System.out.println(myByteArray[i]);
}
Any help Please?

String filename = "D:/try.avi";
byte[] myByteArray = filename.getBytes();
That is converting the file name to bytes, not the file content.
As for reading the content of the file, see the Basic I/O lesson of the Java Tutorial.

Videos in same container formats start with same bytes. The codec used determines the actual video files.
I suggest you read more about container file formats and codecs first if you plan developing video applications.
But you have a different problem. As Andrew Thompson correctly pointed out, you are getting the bytes of the filename string.
The correct approach would be:
private static File fl=new File("D:\video.avi");
byte[] myByteArray = getBytesFromFile(fl);
Please also bear in mind that terminals usually have fixed buffer size (on Windows, it's several lines), so outputting a big chunk of data will display only last several lines of it.
Edit: Here's an implementation of getBytesFromFile; a java expert may offer more standard approach.
public static byte[] getBytesFromFile(File file) throws IOException {
InputStream is = openFile(file.getPath());
// Get the size of the file
long length = file.length();
if (length > Integer.MAX_VALUE) {
// File is too large
Assert.assertExp(false);
logger.warn(file.getPath()+" is too big");
}
// Create the byte array to hold the data
byte[] bytes = new byte[(int)length];
// debug - init array
for (int i = 0; i < length; i++){
bytes[i] = 0x0;
}
// Read in the bytes
int offset = 0;
int numRead = 0;
while (offset < bytes.length && (numRead=is.read(bytes, offset, bytes.length-offset)) >= 0) {
offset += numRead;
}
// Ensure all the bytes have been read in
if (offset < bytes.length) {
throw new IOException("Could not completely read file "+file.getName());
}
// Close the input stream and return bytes
is.close();
return bytes;
}

If you want to read the contents of the video file then use File.
String filename = "D:/try.avi";
File file=new File(filename);
byte myByteArray[]=new byte[(int)file.length()];
RandomAccessFile raf=new RandomAccessFile(file,"rw");
raf.read(myByteArray);

Reading a binary input stream into a single byte array in Java

The documentation says that one should not use available() method to determine the size of an InputStream. How can I read the whole content of an InputStream into a byte array?
InputStream in; //assuming already present
byte[] data = new byte[in.available()];
in.read(data);//now data is filled with the whole content of the InputStream
I could read multiple times into a buffer of a fixed size, but then, I will have to combine the data I read into a single byte array, which is a problem for me.

The simplest approach IMO is to use Guava and its ByteStreams class:
byte[] bytes = ByteStreams.toByteArray(in);
Or for a file:
byte[] bytes = Files.toByteArray(file);
Alternatively (if you didn't want to use Guava), you could create a ByteArrayOutputStream, and repeatedly read into a byte array and write into the ByteArrayOutputStream (letting that handle resizing), then call ByteArrayOutputStream.toByteArray().
Note that this approach works whether you can tell the length of your input or not - assuming you have enough memory, of course.

Please keep in mind that the answers here assume that the length of the file is less than or equal to Integer.MAX_VALUE(2147483647).
If you are reading in from a file, you can do something like this:
File file = new File("myFile");
byte[] fileData = new byte[(int) file.length()];
DataInputStream dis = new DataInputStream(new FileInputStream(file));
dis.readFully(fileData);
dis.close();
UPDATE (May 31, 2014):
Java 7 adds some new features in the java.nio.file package that can be used to make this example a few lines shorter. See the readAllBytes() method in the java.nio.file.Files class. Here is a short example:
import java.nio.file.FileSystems;
import java.nio.file.Files;
import java.nio.file.Path;
// ...
Path p = FileSystems.getDefault().getPath("", "myFile");
byte [] fileData = Files.readAllBytes(p);
Android has support for this starting in Api level 26 (8.0.0, Oreo).

You can use Apache commons-io for this task:
Refer to this method:
public static byte[] readFileToByteArray(File file) throws IOException
Update:
Java 7 way:
byte[] bytes = Files.readAllBytes(Paths.get(filename));
and if it is a text file and you want to convert it to String (change encoding as needed):
StandardCharsets.UTF_8.decode(ByteBuffer.wrap(bytes)).toString()

You can read it by chunks (byte buffer[] = new byte[2048]) and write the chunks to a ByteArrayOutputStream. From the ByteArrayOutputStream you can retrieve the contents as a byte[], without needing to determine its size beforehand.

I believe buffer length needs to be specified, as memory is finite and you may run out of it
Example:
InputStream in = new FileInputStream(strFileName);
long length = fileFileName.length();
if (length > Integer.MAX_VALUE) {
throw new IOException("File is too large!");
}
byte[] bytes = new byte[(int) length];
int offset = 0;
int numRead = 0;
while (offset < bytes.length && (numRead = in.read(bytes, offset, bytes.length - offset)) >= 0) {
offset += numRead;
}
if (offset < bytes.length) {
throw new IOException("Could not completely read file " + fileFileName.getName());
}
in.close();

Max value for array index is Integer.MAX_INT - it's around 2Gb (2^31 / 2 147 483 647).
Your input stream can be bigger than 2Gb, so you have to process data in chunks, sorry.
InputStream is;
final byte[] buffer = new byte[512 * 1024 * 1024]; // 512Mb
while(true) {
final int read = is.read(buffer);
if ( read < 0 ) {
break;
}
// do processing
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Gets the uncompressed size of this GZIPInputStream? - java

Is there a similiar method like ZipEntry.getSize() for GZIPInputStream No. It's not in the Javadoc => it doesn't exist. What do you need the length for?

There is no reliable way to get the length other than decompressing the whole thing. See Uncompressed file size using zlib's gzip file access function .

No, unfortunately if you wanted to get the uncompressed size, you would have to read the entire stream and increment a counter like you mention in your question. Why do you need to know the size? Could an estimation of the size work for your purposes?

Related

Trying to use BufferedInputStream and Base64 to Encode a large file in Java

Java Reading large files into byte array chunk by chunk

java: read large binary file

Divide the video to bytes

Reading a binary input stream into a single byte array in Java

Categories

Resources