Reading large number of bytes from GZIPInputStream - java

I am reading a gzipped file through GZIPInputStream. I want to read a large amount of data at once, but no matter how many bytes I ask the GZIPInputStream to read, it always reads far less number of bytes. For example,
val bArray = new Array[Byte](81920)
val fis = new FileInputStream(new File(inputFileName))
val gis = new GZIPInputStream(fis)
val bytesRead = gis.read(bArray)
The bytes read are always somewhere around 1800 bytes, while it should be nearly equal to the size of bArray, which is 81920 in this case. Why is it like this? Is there a way to solve this problem, and really have more number of bytes read?

I would try using akka-streams in case you have large amount of data.
implicit val system = ActorSystem()
implicit val ec = system.dispatcher
implicit val materializer = ActorMaterializer()
val fis = new FileInputStream(new File(""))
val gis = new GZIPInputStream(fis)
val bfs: BufferedSource = Source.fromInputStream(gis)
bfs exposes the Flow api for stream processing.
You can also get a stream from that:
val ss: Stream[String] = bfs.bufferedReader().lines()

The read might always return fewer bytes than you ask for, so in general you always have to loop, reading as many as you want.
In other words, giving GZIPInputStream a big buffer doesn't mean it will be filled on a given request.
import java.util.zip.GZIPInputStream
import java.io.FileInputStream
import java.io.File
import java.io.InputStream
import java.io.FilterInputStream
object Unzipped extends App {
val inputFileName = "/tmp/sss.gz"
val bArray = new Array[Byte](80 * 1024)
val fis = new FileInputStream(new File(inputFileName))
val stingy = new StingyInputStream(fis)
val gis = new GZIPInputStream(stingy, 80 * 1024)
val bytesRead = gis.read(bArray, 0, bArray.length)
println(bytesRead)
}
class StingyInputStream(is: InputStream) extends FilterInputStream(is) {
override def read(b: Array[Byte], off: Int, len: Int) = {
val n = len.min(1024)
super.read(b, off, n)
}
}
So instead, loop to drain instead of issuing one read:
import reflect.io.Streamable.Bytes
val sb = new Bytes {
override val length = 80 * 1024L
override val inputStream = gis
}
val res = sb.toByteArray()
println(res.length) // your explicit length
I'm not saying that's the API to use, it's just to demo. I'm too lazy to write a loop.

OK, I found the solution. There is a version of constructor for GZIPInputStream that also takes the size of the buffer.

Related

Zip byte[] in Kotlin. Java code to Kotlin

I'm trying to use this Java code but converting it to Kotlin in Android Studio, but I don't find a equivalent in kotlin for setSize(..) and .length in Kotlin. Could anyone help me?
public static byte[] zipBytes(String filename, byte[] input) throws IOException {
ByteArrayOutputStream baos = new ByteArrayOutputStream();
ZipOutputStream zos = new ZipOutputStream(baos);
ZipEntry entry = new ZipEntry(filename);
entry.setSize(input.length);
zos.putNextEntry(entry);
zos.write(input);
zos.closeEntry();
zos.close();
return baos.toByteArray();
}
Array in Kotlin has size field instead of Java array length and size field is Int in Kotlin, but ZipEntry.setSize(long size) accepts only long. So you can do something like this:
entry.setSize(input.size.toLong())
Or in more Kotlin idiomatic way:
entry.size = input.size.toLong()
when you write a byteArray in kotlin like this :
val byteArray = ByteArray(1024)
var length = byteArray.size
documentation says
An array of bytes. When targeting the JVM, instances of this class are represented as byte[].
#constructor Creates a new array of the specified [size], with all elements initialized to zero.
to prove it, checking the byte code created is this:
byte[] byteArray = new byte[1024];
int test = byteArray.length;
therefor in your case a can code like this.
entry.size = byteArray.size
but type of size is int and entry.size needs a long value, just add .toLong() to size for fixing this issue.
Try to use this code:
Import:
import java.io.IOException
import java.text.DecimalFormat
import java.util.zip.ZipEntry
import java.util.zip.ZipOutputStream
And you code in kotlin:
#Throws(IOException::class)
fun zipBytes(filename: String?, input: ByteArray): ByteArray? {
val baos = ByteArrayOutputStream()
val zos = ZipOutputStream(baos)
val entry = ZipEntry(filename)
entry.size = input.size.toLong()
zos.putNextEntry(entry)
zos.write(input)
zos.closeEntry()
zos.close()
return baos.toByteArray()
}

How to unzip file from InputStream

I'm trying to get a zip file from the server.
Im using HttpURLConnection to get InputStream and this is what i have:
myInputStream.toString().getBytes().toString() is equal to [B#4.....
byte[] bytes = Base64.decode(myInputStream.toString(), Base64.DEFAULT);
String string = new String(bytes, "UTF-8");
string == �&ܢ��z�m����y....
I realy tried to unzip this file but nothing works, also there is so many questions, should I use GZIPInputStream or ZipInputStream? Do I have to save this stream as file, or I can work on InputStream
Please help, my boss is getting impatient:O
I have no idea what is in this file i have to find out:)
GZipInputStream and ZipInputStream are two different formats. https://en.wikipedia.org/wiki/Gzip
It is not a good idea to retrieve a string directly from the stream.From an InputStream, you can create a File and write data into it using a FileOutputStream.
Decoding in Base 64 is something else. If your stream has already decoded the format upstream, it's OK; otherwise you have to encapsulate your stream with another input stream that decodes the Base64 format.
The best practice is to use a buffer to avoid memory overflow.
Here is some Kotlin code that decompresses the InputStream zipped into a file. (simpler than java because the management of byte [] is tedious) :
val fileBinaryDecompress = File(..path..)
val outputStream = FileOutputStream(fileBinaryDecompress)
readFromStream(ZipInputStream(myInputStream), BUFFER_SIZE_BYTES,
object : ReadBytes {
override fun read(buffer: ByteArray) {
outputStream.write(buffer)
}
})
outputStream.close()
interface ReadBytes {
/**
* Called after each buffer fill
* #param buffer filled
*/
#Throws(IOException::class)
fun read(buffer: ByteArray)
}
#Throws(IOException::class)
fun readFromStream(inputStream: InputStream, bufferSize: Int, readBytes: ReadBytes) {
val buffer = ByteArray(bufferSize)
var read = 0
while (read != -1) {
read = inputStream.read(buffer, 0, buffer.size)
if (read != -1) {
val optimizedBuffer: ByteArray = if (buffer.size == read) {
buffer
} else {
buffer.copyOf(read)
}
readBytes.read(optimizedBuffer)
}
}
}
If you want to get the file from the server without decompressing it, remove the ZipInputStream() decorator.
Usually, there is no significant difference between GZIPInputStream or ZipInputStream, so if at all, both should work.
Next, you need to identify whether the zipped stream was Base64 encoded, or the some Base64 encoded contents was put into a zipped stream - from what you put to your question, it seems to be the latter option.
So you should try
ZipInputStream zis = new ZipInputStream( myInputStream );
ZipEntry ze = zis.getNextEntry();
InputStream is = zis.getInputStream( ze );
and proceed from there ...
basically by setting inputStream to be GZIPInputStream should be able to read the actual content.
Also for simplicity using IOUtils package from apache.commons makes your life easy
this works for me:
InputStream is ; //initialize you IS
is = new GZIPInputStream(is);
byte[] bytes = IOUtils.toByteArray(is);
String s = new String(bytes);
System.out.println(s);

How to use ByteStream to read 1Mb of a file into a string

What I have now is using FileInputStream
int length = 1024*1024;
FileInputStream fs = new FileInputStream(new File("foo"));
fs.skip(offset);
byte[] buf = new byte[length];
int bufferSize = fs.read(buf, 0, length);
String s = new String(buf, 0, bufferSize);
I'm wondering how can I realize the same result by using ByteStreams in guava library.
Thanks a lot!
Here's how you could do it with Guava:
byte[] bytes = Files.asByteSource(new File("foo"))
.slice(offset, length)
.read();
String s = new String(bytes, Charsets.US_ASCII);
There are a couple of problems with your code (though it may work fine for files, it won't necessarily for any type of stream):
fs.skip(offset);
This doesn't necessarily skip all offset bytes. You have to either check the number of bytes it skipped in the return value until you've skipped the full amount or use something that does that for you, such as ByteStreams.skipFully.
int bufferSize = fs.read(buf, 0, length);
Again, this won't necessarily read all length bytes, and the number of bytes it does read can be an arbitrary amount--you can't rely on it in general.
String s = new String(buf, 0, bufferSize);
This implicitly uses the system default Charset, which usually isn't a good idea--and when you do want it, it's best to make it explicit with Charset.defaultCharset().
Also note that in general, a certain number of bytes may not translate to a legal sequence of characters depending on the Charset being used (i.e. if it's ASCII you're fine, if it's Unicode, not so much).
Why try to use Guava when it's not necessary ?
In this case, it looks like you're looking exactly for a RandomAccessFile.
File file = new File("foo");
long offset = ... ;
try (RandomAccessFile raf = new RandomAccessFile(file, "r")) {
byte[] buffer = new byte[1014*1024];
raf.seek(offset);
raf.readFully(buffer);
return new String(buffer, Charset.defaultCharset());
}
I'm not aware of a more elegant solution:
public static void main(String[] args) throws IOException {
final int offset = 20;
StringBuilder to = new StringBuilder();
CharStreams.copy(CharStreams.newReaderSupplier(new InputSupplier<InputStream>() {
#Override
public InputStream getInput() throws IOException {
FileInputStream fs = new FileInputStream(new File("pom.xml"));
ByteStreams.skipFully(fs, offset);
return fs;
}
}, Charset.defaultCharset()), to);
System.out.println(to);
}
The only advantage is that you can save some GC time when your String is really big by avoiding conversion into String.

Reading a binary input stream into a single byte array in Java

The documentation says that one should not use available() method to determine the size of an InputStream. How can I read the whole content of an InputStream into a byte array?
InputStream in; //assuming already present
byte[] data = new byte[in.available()];
in.read(data);//now data is filled with the whole content of the InputStream
I could read multiple times into a buffer of a fixed size, but then, I will have to combine the data I read into a single byte array, which is a problem for me.
The simplest approach IMO is to use Guava and its ByteStreams class:
byte[] bytes = ByteStreams.toByteArray(in);
Or for a file:
byte[] bytes = Files.toByteArray(file);
Alternatively (if you didn't want to use Guava), you could create a ByteArrayOutputStream, and repeatedly read into a byte array and write into the ByteArrayOutputStream (letting that handle resizing), then call ByteArrayOutputStream.toByteArray().
Note that this approach works whether you can tell the length of your input or not - assuming you have enough memory, of course.
Please keep in mind that the answers here assume that the length of the file is less than or equal to Integer.MAX_VALUE(2147483647).
If you are reading in from a file, you can do something like this:
File file = new File("myFile");
byte[] fileData = new byte[(int) file.length()];
DataInputStream dis = new DataInputStream(new FileInputStream(file));
dis.readFully(fileData);
dis.close();
UPDATE (May 31, 2014):
Java 7 adds some new features in the java.nio.file package that can be used to make this example a few lines shorter. See the readAllBytes() method in the java.nio.file.Files class. Here is a short example:
import java.nio.file.FileSystems;
import java.nio.file.Files;
import java.nio.file.Path;
// ...
Path p = FileSystems.getDefault().getPath("", "myFile");
byte [] fileData = Files.readAllBytes(p);
Android has support for this starting in Api level 26 (8.0.0, Oreo).
You can use Apache commons-io for this task:
Refer to this method:
public static byte[] readFileToByteArray(File file) throws IOException
Update:
Java 7 way:
byte[] bytes = Files.readAllBytes(Paths.get(filename));
and if it is a text file and you want to convert it to String (change encoding as needed):
StandardCharsets.UTF_8.decode(ByteBuffer.wrap(bytes)).toString()
You can read it by chunks (byte buffer[] = new byte[2048]) and write the chunks to a ByteArrayOutputStream. From the ByteArrayOutputStream you can retrieve the contents as a byte[], without needing to determine its size beforehand.
I believe buffer length needs to be specified, as memory is finite and you may run out of it
Example:
InputStream in = new FileInputStream(strFileName);
long length = fileFileName.length();
if (length > Integer.MAX_VALUE) {
throw new IOException("File is too large!");
}
byte[] bytes = new byte[(int) length];
int offset = 0;
int numRead = 0;
while (offset < bytes.length && (numRead = in.read(bytes, offset, bytes.length - offset)) >= 0) {
offset += numRead;
}
if (offset < bytes.length) {
throw new IOException("Could not completely read file " + fileFileName.getName());
}
in.close();
Max value for array index is Integer.MAX_INT - it's around 2Gb (2^31 / 2 147 483 647).
Your input stream can be bigger than 2Gb, so you have to process data in chunks, sorry.
InputStream is;
final byte[] buffer = new byte[512 * 1024 * 1024]; // 512Mb
while(true) {
final int read = is.read(buffer);
if ( read < 0 ) {
break;
}
// do processing
}

Gets the uncompressed size of this GZIPInputStream?

I have a GZIPInputStream that I constructed from another ByteArrayInputStream. I want to know the original (uncompressed) length for the gzip data. Although I can read to the end of the GZIPInputStream, then count the number, it will cost much time and waste CPU. I would like to know the size before read it.
Is there a similiar method like ZipEntry.getSize() for GZIPInputStream:
public long getSize ()
Since: API Level 1
Gets the uncompressed size of this ZipEntry.
It is possible to determine the uncompressed size by reading the last four bytes of the gzipped file.
I found this solution here:
http://www.abeel.be/content/determine-uncompressed-size-gzip-file
Also from this link there is some example code (corrected to use long instead of int, to cope with sizes between 2GB and 4GB which would make an int wrap around):
RandomAccessFile raf = new RandomAccessFile(file, "r");
raf.seek(raf.length() - 4);
byte b4 = raf.read();
byte b3 = raf.read();
byte b2 = raf.read();
byte b1 = raf.read();
long val = ((long)b1 << 24) | ((long)b2 << 16) | ((long)b3 << 8) | (long)b4;
raf.close();
val is the length in bytes. Beware: you can not determine the correct uncompressed size, when the uncompressed file was greater than 4GB!
Based on #Alexander's answer:
RandomAccessFile raf = new RandomAccessFile(inputFilePath + ".gz", "r");
raf.seek(raf.length() - 4);
byte[] bytes = new byte[4];
raf.read(bytes);
fileSize = ByteBuffer.wrap(bytes).order(ByteOrder.LITTLE_ENDIAN).getInt();
if (fileSize < 0)
fileSize += (1L << 32);
raf.close();
Is there a similiar method like ZipEntry.getSize() for
GZIPInputStream
No. It's not in the Javadoc => it doesn't exist.
What do you need the length for?
There is no reliable way to get the length other than decompressing the whole thing. See Uncompressed file size using zlib's gzip file access function .
If you can guess at the compression ratio (a reasonable expectation if the data is similar to other data you've already processed), then you can work out the size of arbitrarily large files (with some error). Again, this assumes a file containing a single gzip stream. The following assumes the first size greater than 90% of the estimated size (based on estimated ratio) is the true size:
estCompRatio = 6.1;
RandomAccessFile raf = new RandomAccessFile(inputFilePath + ".gz", "r");
compLength = raf.length();
byte[] bytes = new byte[4];
raf.read(bytes);
uncLength = ByteBuffer.wrap(bytes).order(ByteOrder.LITTLE_ENDIAN).getInt();
raf.seek(compLength - 4);
uncLength = raf.readInt();
while(uncLength < (compLength * estCompRatio * 0.9)){
uncLength += (1L << 32);
}
[setting estCompRatio to 0 is equivalent to #Alexander's answer]
A more compact version of the calculation based on the 4 tail bytes (avoids using a byte buffer, calls Integer.reverseBytes to reverse the byte order of read bytes).
private static long getUncompressedSize(Path inputPath) throws IOException
{
long size = -1;
try (RandomAccessFile fp = new RandomAccessFile(inputPath.toFile(), "r")) {
fp.seek(fp.length() - Integer.BYTES);
int n = fp.readInt();
size = Integer.toUnsignedLong(Integer.reverseBytes(n));
}
return size;
}
Get the FileChannel from the underlying FileInputStream instead. It tells you both file size and current position of the compressed file. Example:
#Override
public void produce(final DataConsumer consumer, final boolean skipData) throws IOException {
try (FileInputStream fis = new FileInputStream(tarFile)) {
FileChannel channel = fis.getChannel();
final Eta<Long> eta = new Eta<>(channel.size());
try (InputStream is = tarFile.getName().toLowerCase().endsWith("gz")
? new GZIPInputStream(fis) : fis) {
try (TarArchiveInputStream tais = (TarArchiveInputStream) new ArchiveStreamFactory()
.createArchiveInputStream("tar", new BufferedInputStream(is))) {
TarArchiveEntry tae;
boolean done = false;
while (!done && (tae = tais.getNextTarEntry()) != null) {
if (tae.getName().startsWith("docs/") && tae.getName().endsWith(".html")) {
String data = null;
if (!skipData) {
data = new String(tais.readNBytes((int) tae.getSize()), StandardCharsets.UTF_8);
}
done = !consumer.consume(data);
}
String progress = eta.toStringPeriodical(channel.position());
if (progress != null) {
System.out.println(progress);
}
}
System.out.println("tar bytes read: " + tais.getBytesRead());
} catch (ArchiveException ex) {
throw new IOException(ex);
}
}
}
}
No, unfortunately if you wanted to get the uncompressed size, you would have to read the entire stream and increment a counter like you mention in your question. Why do you need to know the size? Could an estimation of the size work for your purposes?

Categories

Resources