Extract tar.gz file in memory in Java

Extract tar.gz file in memory in Java - java

I'm using the Apache Compress library to read a .tar.gz file, something like this:
final TarArchiveInputStream tarIn = initializeTarArchiveStream(this.archiveFile);
try {
TarArchiveEntry tarEntry = tarIn.getNextTarEntry();
while (tarEntry != null) {
byte[] btoRead = new byte[1024];
BufferedOutputStream bout = new BufferedOutputStream(new FileOutputStream(destPath)); //<- I don't want this!
int len = 0;
while ((len = tarIn.read(btoRead)) != -1) {
bout.write(btoRead, 0, len);
}
bout.close();
tarEntry = tarIn.getNextTarEntry();
}
tarIn.close();
}
catch (IOException e) {
e.printStackTrace();
}
Is it possible not to extract this into a seperate file, and read it in memory somehow? Maybe into a giant String or something?

You could replace the file stream with a ByteArrayOutputStream.
i.e. replace this:
BufferedOutputStream bout = new BufferedOutputStream(new FileOutputStream(destPath)); //<- I don't want this!
with this:
ByteArrayOutputStream bout = new ByteArrayOutputStream();
and then after closing bout, use bout.toByteArray() to get the bytes.

Is it possible not to extract this into a seperate file, and read it in memory somehow? Maybe into a giant String or something?
Yea sure.
Just replace the code in the inner loop that is openning files and writing to them with code that writes to a ByteArrayOutputStream ... or a series of such streams.
The natural representation of the data that you read from the TAR (like that) will be bytes / byte arrays. If the bytes are properly encoded characters, and you know the correct encoding, then you can convert them to strings. Otherwise, it is better to leave the data as bytes. (If you attempt to convert non-text data to strings, or if you convert using the wrong charset/encoding you are liable to mangle it ... irreversibly.)
Obviously, you are going to need to think through some of these issues yourself, but basic idea should work ... provided you have enough heap space.

copy the value of btoread to a String like
String s = String.valueof(byteVar);
and goon appending the byte value to the string untill end of the file reaches..

Related

Zip and Unzip a large file without loading the entire file in memory in apache Camel

We are using Apache Camel for compressing and decompressing our files.
We use the standard .marshal().gzip() and .unmarshall().gzip() APIs.
Our problem is that when we get really large files, say 800MB to more than 1GB file size, our application runs out of memory, since the entire file is loading into memory for compression and decompression.
Are there any camel apis or java libraries which will help zip/unzip the file without loading the entire file in memory.
There is a similar unanswered question here

Explanation
Use a different approach: Stream the file.
That is, don't load it into memory completely but read it byte per byte and simultaneously write it back byte per byte .
Get an InputStream to the file, wrap some GZipInputStream around. Read byte per byte, write to an OutputStream.
The opposite if you want to compress an archive. Then you wrap the OutputStream by some GZipOutputStream.
Code
The example uses Apache Commons Compress but the logic of the code remains the same for all libraries.
Unpacking a gz archive:
Path inputPath = Paths.get("archive.tar.gz");
Path outputPath = Paths.get("archive.tar");
try (InputStream fin = Files.newInputStream(inputPath );
OutputStream out = Files.newOutputStream(outputPath);) {
GZipCompressorInputStream in = new GZipCompressorInputStream(
new BufferedInputStream(fin));
// Read and write byte by byte
final byte[] buffer = new byte[buffersize];
int n = 0;
while (-1 != (n = in.read(buffer))) {
out.write(buffer, 0, n);
}
}
Packing as gz archive:
Path inputPath = Paths.get("archive.tar");
Path outputPath = Paths.get("archive.tar.gz");
try (InputStream in = Files.newInputStream(inputPath);
OutputStream fout = Files.newOutputStream(outputPath);) {
GZipCompressorOutputStream out = new GZipCompressorOutputStream(
new BufferedOutputStream(fout));
// Read and write byte by byte
final byte[] buffer = new byte[buffersize];
int n = 0;
while (-1 != (n = in.read(buffer))) {
out.write(buffer, 0, n);
}
}
You could also wrap BufferedReader and PrintWriter around if you feel more comfortable with them. They manage the buffering themselves and you can read and write lines instead of bytes. Note that this only works correct if you read a file with lines and not some other format.

How to detect if BufferedInputStream was over while filling an array in Java?

FileInputStream fin = new FileInputStream(path);
BufferedInputStream bin = new BufferedInputStream(fin);
byte[] inputByte1= new byte[500];
byte[] inputByte2= new byte[500];
byte[] inputByte3 =new byte[34];
bin.read(inputByte1);
bin.read(inputByte2);
bin.read(inputByte3);
Let's say the file had only 400 bytes. How can I detect it?
I know that I could check if (bin.read(inputByte1)!=500)
But this looks really ugly to write in each line.
My main questions is:
How to detect if before filling some array the buffer was done.
I do not want to do bin.read() for each byte and check bin.read!=-1.

First, on a Windows based system you need to escape the \ when you use it as a path separator. Next, you could use a FileInputStream (which you could wrap with a BufferedInputStream). Finally, you should close the InputStream when you're done (or you risk leaking file handles, sockets or some other resource). You might use a try-with-resources statement. Putting it all together, it might look something like
File f = new File("c:\\test\\test.txt");
try (InputStream is = new BufferedInputStream(new FileInputStream(f))) {
int val;
while ((val = is.read()) != -1) {
System.out.println((byte) val);
}
} catch (IOException ioe) {
ioe.printStackTrace();
}

I would recommend a DataInputStream over a BufferedInputStream over a FileInputStream. That way you can use the readFully() method to read exactly as many bytes as you need each time without having to loop.
c:\test\test.txt
Use forward slashes in Java:
c:/test/test.txt

How to unzip file from InputStream

I'm trying to get a zip file from the server.
Im using HttpURLConnection to get InputStream and this is what i have:
myInputStream.toString().getBytes().toString() is equal to [B#4.....
byte[] bytes = Base64.decode(myInputStream.toString(), Base64.DEFAULT);
String string = new String(bytes, "UTF-8");
string == �&ܢ��z�m����y....
I realy tried to unzip this file but nothing works, also there is so many questions, should I use GZIPInputStream or ZipInputStream? Do I have to save this stream as file, or I can work on InputStream
Please help, my boss is getting impatient:O
I have no idea what is in this file i have to find out:)

GZipInputStream and ZipInputStream are two different formats. https://en.wikipedia.org/wiki/Gzip
It is not a good idea to retrieve a string directly from the stream.From an InputStream, you can create a File and write data into it using a FileOutputStream.
Decoding in Base 64 is something else. If your stream has already decoded the format upstream, it's OK; otherwise you have to encapsulate your stream with another input stream that decodes the Base64 format.
The best practice is to use a buffer to avoid memory overflow.
Here is some Kotlin code that decompresses the InputStream zipped into a file. (simpler than java because the management of byte [] is tedious) :
val fileBinaryDecompress = File(..path..)
val outputStream = FileOutputStream(fileBinaryDecompress)
readFromStream(ZipInputStream(myInputStream), BUFFER_SIZE_BYTES,
object : ReadBytes {
override fun read(buffer: ByteArray) {
outputStream.write(buffer)
}
})
outputStream.close()
interface ReadBytes {
/**
* Called after each buffer fill
* #param buffer filled
*/
#Throws(IOException::class)
fun read(buffer: ByteArray)
}
#Throws(IOException::class)
fun readFromStream(inputStream: InputStream, bufferSize: Int, readBytes: ReadBytes) {
val buffer = ByteArray(bufferSize)
var read = 0
while (read != -1) {
read = inputStream.read(buffer, 0, buffer.size)
if (read != -1) {
val optimizedBuffer: ByteArray = if (buffer.size == read) {
buffer
} else {
buffer.copyOf(read)
}
readBytes.read(optimizedBuffer)
}
}
}
If you want to get the file from the server without decompressing it, remove the ZipInputStream() decorator.

Usually, there is no significant difference between GZIPInputStream or ZipInputStream, so if at all, both should work.
Next, you need to identify whether the zipped stream was Base64 encoded, or the some Base64 encoded contents was put into a zipped stream - from what you put to your question, it seems to be the latter option.
So you should try
ZipInputStream zis = new ZipInputStream( myInputStream );
ZipEntry ze = zis.getNextEntry();
InputStream is = zis.getInputStream( ze );
and proceed from there ...

basically by setting inputStream to be GZIPInputStream should be able to read the actual content.
Also for simplicity using IOUtils package from apache.commons makes your life easy
this works for me:
InputStream is ; //initialize you IS
is = new GZIPInputStream(is);
byte[] bytes = IOUtils.toByteArray(is);
String s = new String(bytes);
System.out.println(s);

FileOutputStream:Something That I am Missing Out?

I have this program that reads 2 Kb Data from a binary file adds some header to it and then writes it to a new file.
The code is
try {
FileInputStream fis = new FileInputStream(bin);
FileOutputStream fos = new FileOutputStream(bin.getName().replace(".bin", ".xyz"));
DataOutputStream dos=new DataOutputStream(fos);
fos.write(big, 0, big.length);
for (int n = 1; n <= pcount; n++) {
fis.read(file, mark, 2048);
mark = mark + 2048;
prbar.setValue(n);
prbar.setString("Converted packets:" + String.valueOf(n));
metas = "2048";
meta = metas.getBytes();
pc = String.valueOf(file.length).getBytes();
nval = String.valueOf(n).getBytes();
System.arraycopy(pc, 0, bmeta, 0, pc.length);
System.arraycopy(meta, 0, bmeta, 4, meta.length);
System.arraycopy(nval, 0, bmeta, 8, nval.length);
fos.write(bmeta, 0, bmeta.length);
fos.flush();
fos.write(file, 0, 2048);
fos.flush();
}
}catch (Exception ex) {
erlabel.setText(ex.getMessage());
}
First it should write the header and then the file.But the output file is full of data that does not belong to the file.It is writing some garbage data.What may be the problem?

It's not quite clear with some of the declarations missing, but it looks like your problem is with the fis.read() method: the second argument is an offset in the byte array, not the file (common mistake).
You probably want to use relative reads. You also need to check the return value from .read() to see how many bytes were actually read, before writing the buffer out.
The common idiom is:
InputStream is = ...
OutputStream os = ...
byte[] buf = new byte[2048];
int len;
while((len = is.read(buf)) != -1)
os.write(buf, 0, len);
is.close();
os.close();
Edit
That's a pretty weird way of writing out your metadata, I assume that's what the (unused) DataOutputStream is for?
You don't need to keep flushing the output stream, just close it when you're done.

In addition to what #Dmitri has pointed out, there is something seriously wrong with the way you are writing the metadata.
You are writing the metadata every time around the loop, which cannot be right.
You are essentially allocating 4 bytes for it, via "2048".getBytes(), then copying many more than 4 bytes into it, then writing the 4 bytes. This cannot be right either, in fact it should really be throwing ArrayIndexExceptions at you.
It looks as though the metadata is supposed to contain three binary integers. However you are putting String data into it. I suspect you should be using DataOutputStream.writeInt() directly for these fields, without all the String.valueOf()/getBytes() and System.arraycopy() nonsense.

I would like suggest to use lib community supported like apache common-io for IO features.
There are usefule classes and method;
org.apache.commons.io.DirectoryWalker;
org.apache.commons.io.FileUtils;
org.apache.commons.io.IOCase;
FileUtils.copyDirectory(from, to);
FileUtils.writeByteArrayToFile(file, data);
FileUtils.writeStringToFile(file, data);
FileUtils.deleteDirectory(dir);
FileUtils.forceDelete(dir);

Reading from a ZipInputStream into a ByteArrayOutputStream

I am trying to read a single file from a java.util.zip.ZipInputStream, and copy it into a java.io.ByteArrayOutputStream (so that I can then create a java.io.ByteArrayInputStream and hand that to a 3rd party library that will end up closing the stream, and I don't want my ZipInputStream getting closed).
I'm probably missing something basic here, but I never enter the while loop here:
ByteArrayOutputStream streamBuilder = new ByteArrayOutputStream();
int bytesRead;
byte[] tempBuffer = new byte[8192*2];
try {
while ((bytesRead = zipStream.read(tempBuffer)) != -1) {
streamBuilder.write(tempBuffer, 0, bytesRead);
}
} catch (IOException e) {
// ...
}
What am I missing that will allow me to copy the stream?
Edit:
I should have mentioned earlier that this ZipInputStream is not coming from a file, so I don't think I can use a ZipFile. It is coming from a file uploaded through a servlet.
Also, I have already called getNextEntry() on the ZipInputStream before getting to this snippet of code. If I don't try copying the file into another InputStream (via the OutputStream mentioned above), and just pass the ZipInputStream to my 3rd party library, the library closes the stream, and I can't do anything more, like dealing with the remaining files in the stream.

Your loop looks valid - what does the following code (just on it's own) return?
zipStream.read(tempBuffer)
if it's returning -1, then the zipStream is closed before you get it, and all bets are off. It's time to use your debugger and make sure what's being passed to you is actually valid.
When you call getNextEntry(), does it return a value, and is the data in the entry meaningful (i.e. does getCompressedSize() return a valid value)? IF you are just reading a Zip file that doesn't have read-ahead zip entries embedded, then ZipInputStream isn't going to work for you.
Some useful tidbits about the Zip format:
Each file embedded in a zip file has a header. This header can contain useful information (such as the compressed length of the stream, it's offset in the file, CRC) - or it can contain some magic values that basically say 'The information isn't in the stream header, you have to check the Zip post-amble'.
Each zip file then has a table that is attached to the end of the file that contains all of the zip entries, along with the real data. The table at the end is mandatory, and the values in it must be correct. In contrast, the values embedded in the stream do not have to be provided.
If you use ZipFile, it reads the table at the end of the zip. If you use ZipInputStream, I suspect that getNextEntry() attempts to use the entries embedded in the stream. If those values aren't specified, then ZipInputStream has no idea how long the stream might be. The inflate algorithm is self terminating (you actually don't need to know the uncompressed length of the output stream in order to fully recover the output), but it's possible that the Java version of this reader doesn't handle this situation very well.
I will say that it's fairly unusual to have a servlet returning a ZipInputStream (it's much more common to receive an inflatorInputStream if you are going to be receiving compressed content.

You probably tried reading from a FileInputStream like this:
ZipInputStream in = new ZipInputStream(new FileInputStream(...));
This won’t work since a zip archive can contain multiple files and you need to specify which file to read.
You could use java.util.zip.ZipFile and a library such as IOUtils from Apache Commons IO or ByteStreams from Guava that assist you in copying the stream.
Example:
ByteArrayOutputStream out = new ByteArrayOutputStream();
try (ZipFile zipFile = new ZipFile("foo.zip")) {
ZipEntry zipEntry = zipFile.getEntry("fileInTheZip.txt");
try (InputStream in = zipFile.getInputStream(zipEntry)) {
IOUtils.copy(in, out);
}
}

I'd use IOUtils from the commons io project.
IOUtils.copy(zipStream, byteArrayOutputStream);

You're missing call
ZipEntry entry = (ZipEntry) zipStream.getNextEntry();
to position the first byte decompressed of the first entry.
ByteArrayOutputStream streamBuilder = new ByteArrayOutputStream();
int bytesRead;
byte[] tempBuffer = new byte[8192*2];
ZipEntry entry = (ZipEntry) zipStream.getNextEntry();
try {
while ( (bytesRead = zipStream.read(tempBuffer)) != -1 ){
streamBuilder.write(tempBuffer, 0, bytesRead);
}
} catch (IOException e) {
...
}

You could implement your own wrapper around the ZipInputStream that ignores close() and hand that off to the third-party library.
thirdPartyLib.handleZipData(new CloseIgnoringInputStream(zipStream));
class CloseIgnoringInputStream extends InputStream
{
private ZipInputStream stream;
public CloseIgnoringInputStream(ZipInputStream inStream)
{
stream = inStream;
}
public int read() throws IOException {
return stream.read();
}
public void close()
{
//ignore
}
public void reallyClose() throws IOException
{
stream.close();
}
}

I would call getNextEntry() on the ZipInputStream until it is at the entry you want (use ZipEntry.getName() etc.). Calling getNextEntry() will advance the "cursor" to the beginning of the entry that it returns. Then, use ZipEntry.getSize() to determine how many bytes you should read using zipInputStream.read().

It is unclear how you got the zipStream. It should work when you get it like this:
zipStream = zipFile.getInputStream(zipEntry)

t is unclear how you got the zipStream. It should work when you get it like this:
zipStream = zipFile.getInputStream(zipEntry)
If you are obtaining the ZipInputStream from a ZipFile you can get one stream for the 3d party library, let it use it, and you obtain another input stream using the code before.
Remember, an inputstream is a cursor. If you have the entire data (like a ZipFile) you can ask for N cursors over it.
A diferent case is if you only have an "GZip" inputstream, only an zipped byte stream. In that case you ByteArrayOutputStream buffer makes all sense.

Please try code bellow
private static byte[] getZipArchiveContent(File zipName) throws WorkflowServiceBusinessException {
BufferedInputStream buffer = null;
FileInputStream fileStream = null;
ByteArrayOutputStream byteOut = null;
byte data[] = new byte[BUFFER];
try {
try {
fileStream = new FileInputStream(zipName);
buffer = new BufferedInputStream(fileStream);
byteOut = new ByteArrayOutputStream();
int count;
while((count = buffer.read(data, 0, BUFFER)) != -1) {
byteOut.write(data, 0, count);
}
} catch(Exception e) {
throw new WorkflowServiceBusinessException(e.getMessage(), e);
} finally {
if(null != fileStream) {
fileStream.close();
}
if(null != buffer) {
buffer.close();
}
if(null != byteOut) {
byteOut.close();
}
}
} catch(Exception e) {
throw new WorkflowServiceBusinessException(e.getMessage(), e);
}
return byteOut.toByteArray();
}

Check if the input stream is positioned in the begging.
Otherwise, as implementation: I do not think that you need to write to the result stream while you are reading, unless you process this exact stream in another thread.
Just create a byte array, read the input stream, then create the output stream.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Extract tar.gz file in memory in Java - java

copy the value of btoread to a String like String s = String.valueof(byteVar); and goon appending the byte value to the string untill end of the file reaches..

Related

Zip and Unzip a large file without loading the entire file in memory in apache Camel

How to detect if BufferedInputStream was over while filling an array in Java?

How to unzip file from InputStream

FileOutputStream:Something That I am Missing Out?

Reading from a ZipInputStream into a ByteArrayOutputStream

Categories

Resources