How do parse some files in a tar.bz2 archive with Java - java

So I have written the parser for parsing an individual file but can i read each file within the archive without having to actually extract the archive to disk

Following the examples in http://commons.apache.org/proper/commons-compress/examples.html you have to wrap one InputStream with another
// 1st InputStream from your compressed file
FileInputStream in = new FileInputStream(tarbz2File);
// wrap in a 2nd InputStream that deals with compression
BZip2CompressorInputStream bzIn = new BZip2CompressorInputStream(in);
// wrap in a 3rd InputStream that deals with tar
TarArchiveInputStream tarIn = new TarArchiveInputStream(bzIn);
ArchiveEntry entry = null;
while (null != (entry = tarIn.getNextEntry())){
if (entry.getSize() < 1){
continue;
}
// use your parser here, the tar inputStream deals with the size of the current entry
parser.parse(tarIn);
}
tarIn.close();

Related

Spark Reading .7z files

I am trying to read the spark .7z files using scala or java. I dont find any appropriate methods or functionality.
For the zip file, i am able to read as the ZipInputStream class takes a Input stream, but for the 7Z files the class SevenZFile doesnt take any input stream.
https://commons.apache.org/proper/commons-compress/javadocs/api-1.16/org/apache/commons/compress/archivers/sevenz/SevenZFile.html
Zip file code
spark.sparkContext.binaryFiles("fileName").flatMap{case (name: String, content: PortableDataStream) =>
val zis = new ZipInputStream(content.open)
Stream.continually(zis.getNextEntry)
.takeWhile(_ != null)
.flatMap { _ =>
val br = new BufferedReader(new InputStreamReader(zis))
Stream.continually(br.readLine()).takeWhile(_ != null)
}}
I am trying similar code for the 7z files something like
spark.sparkContext.binaryFiles(""filename"").flatMap{case (name: String, content: PortableDataStream) =>
val zis = new SevenZFile(content.open)
Stream.continually(zis.getNextEntry)
.takeWhile(_ != null)
.flatMap { _ =>
val br = new BufferedReader(new InputStreamReader(zis))
Stream.continually(br.readLine()).takeWhile(_ != null)
}}
But SevenZFile doesnt accept these formats.Looking for ideas.
If the file is in local filessytem following solution works, but my file is in hdfs
Local fileSystem Code
public static void decompress(String in, File destination) throws IOException {
SevenZFile sevenZFile = new SevenZFile(new File(in));
SevenZArchiveEntry entry;
while ((entry = sevenZFile.getNextEntry()) != null){
if (entry.isDirectory()){
continue;
}
File curfile = new File(destination, entry.getName());
File parent = curfile.getParentFile();
if (!parent.exists()) {
parent.mkdirs();
}
FileOutputStream out = new FileOutputStream(curfile);
byte[] content = new byte[(int) entry.getSize()];
sevenZFile.read(content, 0, content.length);
out.write(content);
out.close();
}
}
After all these years of spark evolution there should be easy way to do it.
Instead of using the java.io.File-based approach, you could try the SeekableByteChannel method as shown in this alternative constructor.
You can use a SeekableInMemoryByteChannel to read a byte array. So as long as you can pick up the 7zip files from S3 or whatever and hand them off as byte arrays you should be alright.
With all of that said, Spark is really not well-suited for processing things like zip and 7zip files. I can tell you from personal experience I've seen it fail badly once the files are too large for Spark's executors to handle.
Something like Apache NiFi will work much better for expanding archives and processing them. FWIW, I'm currently handling a large data dump that has me frequently dealing with 50GB tarballs that have several million files in them, and NiFi handles them very gracefully.

called a soap web service which returns a zip file as an attachment. How to unzip it in memory?

I have seen posts about how to unzip files using Java, where the zip file is located somewhere on disk. In my case it's different.
I have code which calls a soap web service. The service response includes an attachment which is a zip file. I have been able to get the attachment. here is part of the code:
Iterator<?> i = soapResponse.getAttachments();
Object obj = null;
AttachmentPart att = (AttachmentPart) i.next();
So, I have the zip file as a type "AttachmentPart" however I could also do:
byte[] arr1 = att.getRawContentBytes();
which would give me the array of bytes containing the zip file.
I could also do
Object obj = att.getContent()
So, I can get the zip files in different formats/types. The zip files contains two .csv files and I have to do different stuff to those files. To make my question simpler, all I am looking to do for now is to get the two .csv files and print its content to the console.
I want to do everything in memory. I don't want to put the content of the zip files on disk.
How can I unzip the attachment and print the content?
If you grab the att.getRawContent() from the AttachmentPart object, you can pass it to the built in ZipInputStream to read the contents of the zip file. You can then write the bytes read from the ZipInputStream directly to System.out to view the contents on the console.
Below is an example that should read the zip contents and then write the entry name followed by the entry contents to standard out, assuming you pass it the AttachmentPart that contains the zip file. It will also filter out any entries that are directories so that they are not printed.
public static void printAttachmentPartZip(AttachmentPart att) throws IOException, SOAPException {
try (ZipInputStream zis = new ZipInputStream(att.getRawContent())) {
byte[] buffer = new byte[1024];
for (ZipEntry zipEntry = zis.getNextEntry(); zipEntry != null; zipEntry = zis.getNextEntry()) {
if (zipEntry.isDirectory()) {
continue;
}
System.out.println(zipEntry.getName());
for (int len = zis.read(buffer); len > 0; len = zis.read(buffer)) {
System.out.write(buffer, 0, len);
}
}
}
}

Getting specific file from ZipInputStream

I can go through ZipInputStream, but before starting the iteration I want to get a specific file that I need during the iteration. How can I do that?
ZipInputStream zin = new ZipInputStream(myInputStream)
while ((entry = zin.getNextEntry()) != null)
{
println entry.getName()
}
If the myInputStream you're working with comes from a real file on disk then you can simply use java.util.zip.ZipFile instead, which is backed by a RandomAccessFile and provides direct access to the zip entries by name. But if all you have is an InputStream (e.g. if you're processing the stream directly on receipt from a network socket or similar) then you'll have to do your own buffering.
You could copy the stream to a temporary file, then open that file using ZipFile, or if you know the maximum size of the data in advance (e.g. for an HTTP request that declares its Content-Length up front) you could use a BufferedInputStream to buffer it in memory until you've found the required entry.
BufferedInputStream bufIn = new BufferedInputStream(myInputStream);
bufIn.mark(contentLength);
ZipInputStream zipIn = new ZipInputStream(bufIn);
boolean foundSpecial = false;
while ((entry = zin.getNextEntry()) != null) {
if("special.txt".equals(entry.getName())) {
// do whatever you need with the special entry
foundSpecial = true;
break;
}
}
if(foundSpecial) {
// rewind
bufIn.reset();
zipIn = new ZipInputStream(bufIn);
// ....
}
(I haven't tested this code myself, you may find it's necessary to use something like the commons-io CloseShieldInputStream in between the bufIn and the first zipIn, to allow the first zip stream to close without closing the underlying bufIn before you've rewound it).
use the getName() method on ZipEntry to get the file you want.
ZipInputStream zin = new ZipInputStream(myInputStream)
String myFile = "foo.txt";
while ((entry = zin.getNextEntry()) != null)
{
if (entry.getName().equals(myFileName)) {
// process your file
// stop looking for your file - you've already found it
break;
}
}
From Java 7 onwards, you are better off using ZipFile instead of ZipStream if you only want one file and you have a file to read from:
ZipFile zfile = new ZipFile(aFile);
String myFile = "foo.txt";
ZipEntry entry = zfile.getEntry(myFile);
if (entry) {
// process your file
}
Look at Finding a file in zip entry
ZipFile file = new ZipFile("file.zip");
ZipInputStream zis = searchImage("foo.png", file);
public searchImage(String name, ZipFile file)
{
for (ZipEntry e : file.entries){
if (e.getName().endsWith(name)){
return file.getInputStream(e);
}
}
return null;
}
I'm late to the party, but all above "answers" does not answer the question and accepted "answer" suggest create temp file which is inefficient.
Lets create sample zip file:
seq 10000 | sed "s/^.*$/a/"> /tmp/a
seq 10000 20000 | sed "s/^.*$/b/"> /tmp/b
seq 20000 30000 | sed "s/^.*$/c/"> /tmp/c
zip /tmp/out.zip /tmp/a /tmp/b /tmp/c
so now we have /tmp/out.zip file, which contains 3 files, each of them full of chars a, b or c.
Now lets read it:
public static void main(String[] args) throws IOException {
ZipInputStream zipStream = new ZipInputStream(new FileInputStream("/tmp/out.zip"));
ZipEntry zipEntry;
while ((zipEntry = zipStream.getNextEntry()) != null) {
String name = zipEntry.getName();
System.out.println("Entry: "+name);
if (name.equals("tmp/c")) {
byte[] bytes = zipStream.readAllBytes();
String s = new String(bytes);
System.out.println(s);
}
}
}
method readAllBytes seems weird, while we're in processing of stream, but it seems to work, I tested it also on some images, where there is higher chance of failure. So it's probably just unintuitive api, but it seems to work.

What is the idiomatic way to copy a ZipEntry into a new ZipFile?

I'm writing a tool to do some minor text replacement in a DOCX file, which is a zipped format. My method is to copy ZipEntry contents from entries in the original file into the modified file using a ZipOutputStream. For most DOCX files this works well, but occasionally I will encounter ZipExceptions regarding discrepancies between the contents I've written and the meta-information contained in the ZipEntry (usually a difference in compressed size).
Here's the code I'm using to copy over contents. I've stripped out error handling and document processing for brevity; I haven't had issues with the document entry so far.
ZipFile original = new ZipFile(INPUT_FILENAME);
ZipOutputStream outputStream = new ZipOutputStream(new FileOutputStream(OUTPUT_FILE));
Enumeration entries = original.entries();
byte[] buffer = new byte[512];
while (entries.hasMoreElements()) {
ZipEntry entry = (ZipEntry)entries.nextElement();
if ("word/document.xml".equalsIgnoreCase(entry.getName())) {
//perform special processing
}
else{
outputStream.putNextEntry(entry);
InputStream in = original.getInputStream(entry);
while (0 < in.available()){
int read = in.read(buffer);
outputStream.write(buffer,0,read);
}
in.close();
}
outputStream.closeEntry();
}
outputStream.close();
What is the proper or idiomatic way to directly copy ZipEntry objects from one ZipFile to another?
I have found a workaround that avoids the error. By creating a new ZipEntry with only the name field set I'm able to copy over contents without issue.
ZipFile original = new ZipFile(INPUT_FILENAME);
ZipOutputStream outputStream = new ZipOutputStream(new FileOutputStream(OUTPUT_FILE));
Enumeration entries = original.entries();
byte[] buffer = new byte[512];
while (entries.hasMoreElements()) {
ZipEntry entry = (ZipEntry)entries.nextElement();
if ("word/document.xml".equalsIgnoreCase(entry.getName())) {
//perform special processing
}
else{
// create a new empty ZipEntry
ZipEntry newEntry = new ZipEntry(entry.getName());
// outputStream.putNextEntry(entry);
outputStream.putNextEntry(newEntry);
InputStream in = original.getInputStream(entry);
while (0 < in.available()){
int read = in.read(buffer);
if (read > 0) {
outputStream.write(buffer,0,read);
}
}
in.close();
}
outputStream.closeEntry();
}
outputStream.close();
However, I this method loses any meta-information stored in the fields of original ZipEntry (e.g.: comment, extra). The API docs aren't clear on whether this is important.
To keep your meta-data for the zip entry, create it using ZipEntry's "copy constructor":
ZipEntry newEntry = new ZipEntry(entry);
You can then modify just the name or the comments etc. and everything else will be copied from the given entry.
You could also look at Docmosis which can populate DocX files from Java.

java.util.zip.ZipException: too many entries in ZIP file

I am trying to write a Java class to extract a large zip file containing ~74000 XML files. I get the following exception when attempting to unzip it utilizing the java zip library:
java.util.zip.ZipException: too many entries in ZIP file
Unfortunately due to requirements of the project I can not get the zip broken down before it gets to me, and the unzipping process has to be automated (no manual steps). Is there any way to get around this limitation utilizing java.util.zip or with some 3rd party Java zip library?
Thanks.
Using ZipInputStream instead of ZipFile should probably do it.
Using apache IOUtils:
FileInputStream fin = new FileInputStream(zip);
ZipInputStream zin = new ZipInputStream(fin);
ZipEntry ze = null;
while ((ze = zin.getNextEntry()) != null) {
FileOutputStream fout = new FileOutputStream(new File(
outputDirectory, ze.getName()));
IOUtils.copy(zin, fout);
IOUtils.closeQuietly(fout);
zin.closeEntry();
}
IOUtils.closeQuietly(zin);
The Zip standard supports a max of 65536 entries in a file.
Unless the Java library supports ZIP64 extensions, it won't work properly if you are trying to read or write an archive with 74,000 entries.
I reworked the method to deal with directory structures more convenient and to zip a whole bunch of targets at once.
Plain files will be added to the root of the zip file, if you pass a directory, the underlying structure will be preserved.
def zip (String zipFile, String [] filesToZip){
def result = new ZipOutputStream(new FileOutputStream(zipFile))
result.withStream { zipOutStream ->
filesToZip.each {fileToZip ->
ftz = new File(fileToZip)
if(ftz.isDirectory()){
pathlength = new File(ftz.absolutePath).parentFile.absolutePath.size()
ftz.eachFileRecurse {f ->
if(!f.isDirectory()) writeZipEntry(f, zipOutStream, f.absolutePath[pathlength..-1])
}
}
else writeZipEntry(ftz, zipOutStream, '')
}
}
}
def writeZipEntry(File plainFile, ZipOutputStream zipOutStream, String path) {
zipOutStream.putNextEntry(new ZipEntry(path+plainFile.name))
new FileInputStream(plainFile).withStream { inStream ->
def buffer = new byte[1024]
def count
while((count = inStream.read(buffer, 0, 1024)) != -1)
zipOutStream.write(buffer)
}
zipOutStream.closeEntry()
}

Categories

Resources