I am trying to write a Java class to extract a large zip file containing ~74000 XML files. I get the following exception when attempting to unzip it utilizing the java zip library:
java.util.zip.ZipException: too many entries in ZIP file
Unfortunately due to requirements of the project I can not get the zip broken down before it gets to me, and the unzipping process has to be automated (no manual steps). Is there any way to get around this limitation utilizing java.util.zip or with some 3rd party Java zip library?
Thanks.
Using ZipInputStream instead of ZipFile should probably do it.
Using apache IOUtils:
FileInputStream fin = new FileInputStream(zip);
ZipInputStream zin = new ZipInputStream(fin);
ZipEntry ze = null;
while ((ze = zin.getNextEntry()) != null) {
FileOutputStream fout = new FileOutputStream(new File(
outputDirectory, ze.getName()));
IOUtils.copy(zin, fout);
IOUtils.closeQuietly(fout);
zin.closeEntry();
}
IOUtils.closeQuietly(zin);
The Zip standard supports a max of 65536 entries in a file.
Unless the Java library supports ZIP64 extensions, it won't work properly if you are trying to read or write an archive with 74,000 entries.
I reworked the method to deal with directory structures more convenient and to zip a whole bunch of targets at once.
Plain files will be added to the root of the zip file, if you pass a directory, the underlying structure will be preserved.
def zip (String zipFile, String [] filesToZip){
def result = new ZipOutputStream(new FileOutputStream(zipFile))
result.withStream { zipOutStream ->
filesToZip.each {fileToZip ->
ftz = new File(fileToZip)
if(ftz.isDirectory()){
pathlength = new File(ftz.absolutePath).parentFile.absolutePath.size()
ftz.eachFileRecurse {f ->
if(!f.isDirectory()) writeZipEntry(f, zipOutStream, f.absolutePath[pathlength..-1])
}
}
else writeZipEntry(ftz, zipOutStream, '')
}
}
}
def writeZipEntry(File plainFile, ZipOutputStream zipOutStream, String path) {
zipOutStream.putNextEntry(new ZipEntry(path+plainFile.name))
new FileInputStream(plainFile).withStream { inStream ->
def buffer = new byte[1024]
def count
while((count = inStream.read(buffer, 0, 1024)) != -1)
zipOutStream.write(buffer)
}
zipOutStream.closeEntry()
}
Related
I'm trying to compress multiple files into a single archive but with my current code, it only compresses it into a single blob inside the zip. Does anyone know how to segment the files with LZ4?
public void zipFile(File[] fileToZip, String outputFileName, boolean activeZip)
{
try (FileOutputStream fos = new FileOutputStream(new File(outputFileName), true);
LZ4FrameOutputStream lz4fos = new LZ4FrameOutputStream(fos);)
{
for (File a : fileToZip)
{
try (FileInputStream fis = new FileInputStream(a))
{
byte[] buf = new byte[bufferSizeZip];
int length;
while ((length = fis.read(buf)) > 0)
{
lz4fos.write(buf, 0, length);
}
}
}
}
catch (Exception e)
{
LOG.error("Zipping file failed ", e);
}
}
LZ4 algorithm is close with LZMA. In case you can use LZMA then you can create zip archive with LZMA compression.
List<Path> files = Collections.emptyList();
Path zip = Paths.get("lzma.zip");
ZipEntrySettings entrySettings = ZipEntrySettings.builder()
.compression(Compression.LZMA, CompressionLevel.NORMAL)
.lzmaEosMarker(true).build();
ZipSettings settings = ZipSettings.builder().entrySettingsProvider(fileName -> entrySettings).build();
ZipIt.zip(zip)
.settings(settings)
.add(files);
See details in zip4jvm
LZ4 compresses a stream of bytes. You would need to archive your multiple files into a single archive such as a Tar Archive, then feed it into the LZ4 compressor.
I created a Java library that does this for you https://github.com/spoorn/tar-lz4-java.
If you want to implement it yourself, here's a technical doc that includes details on how to LZ4 compress a directory using TarArchive from Apache Commons and lz4-java: https://github.com/spoorn/tar-lz4-java/blob/main/SUMMARY.md#lz4
I am trying to read the spark .7z files using scala or java. I dont find any appropriate methods or functionality.
For the zip file, i am able to read as the ZipInputStream class takes a Input stream, but for the 7Z files the class SevenZFile doesnt take any input stream.
https://commons.apache.org/proper/commons-compress/javadocs/api-1.16/org/apache/commons/compress/archivers/sevenz/SevenZFile.html
Zip file code
spark.sparkContext.binaryFiles("fileName").flatMap{case (name: String, content: PortableDataStream) =>
val zis = new ZipInputStream(content.open)
Stream.continually(zis.getNextEntry)
.takeWhile(_ != null)
.flatMap { _ =>
val br = new BufferedReader(new InputStreamReader(zis))
Stream.continually(br.readLine()).takeWhile(_ != null)
}}
I am trying similar code for the 7z files something like
spark.sparkContext.binaryFiles(""filename"").flatMap{case (name: String, content: PortableDataStream) =>
val zis = new SevenZFile(content.open)
Stream.continually(zis.getNextEntry)
.takeWhile(_ != null)
.flatMap { _ =>
val br = new BufferedReader(new InputStreamReader(zis))
Stream.continually(br.readLine()).takeWhile(_ != null)
}}
But SevenZFile doesnt accept these formats.Looking for ideas.
If the file is in local filessytem following solution works, but my file is in hdfs
Local fileSystem Code
public static void decompress(String in, File destination) throws IOException {
SevenZFile sevenZFile = new SevenZFile(new File(in));
SevenZArchiveEntry entry;
while ((entry = sevenZFile.getNextEntry()) != null){
if (entry.isDirectory()){
continue;
}
File curfile = new File(destination, entry.getName());
File parent = curfile.getParentFile();
if (!parent.exists()) {
parent.mkdirs();
}
FileOutputStream out = new FileOutputStream(curfile);
byte[] content = new byte[(int) entry.getSize()];
sevenZFile.read(content, 0, content.length);
out.write(content);
out.close();
}
}
After all these years of spark evolution there should be easy way to do it.
Instead of using the java.io.File-based approach, you could try the SeekableByteChannel method as shown in this alternative constructor.
You can use a SeekableInMemoryByteChannel to read a byte array. So as long as you can pick up the 7zip files from S3 or whatever and hand them off as byte arrays you should be alright.
With all of that said, Spark is really not well-suited for processing things like zip and 7zip files. I can tell you from personal experience I've seen it fail badly once the files are too large for Spark's executors to handle.
Something like Apache NiFi will work much better for expanding archives and processing them. FWIW, I'm currently handling a large data dump that has me frequently dealing with 50GB tarballs that have several million files in them, and NiFi handles them very gracefully.
I can go through ZipInputStream, but before starting the iteration I want to get a specific file that I need during the iteration. How can I do that?
ZipInputStream zin = new ZipInputStream(myInputStream)
while ((entry = zin.getNextEntry()) != null)
{
println entry.getName()
}
If the myInputStream you're working with comes from a real file on disk then you can simply use java.util.zip.ZipFile instead, which is backed by a RandomAccessFile and provides direct access to the zip entries by name. But if all you have is an InputStream (e.g. if you're processing the stream directly on receipt from a network socket or similar) then you'll have to do your own buffering.
You could copy the stream to a temporary file, then open that file using ZipFile, or if you know the maximum size of the data in advance (e.g. for an HTTP request that declares its Content-Length up front) you could use a BufferedInputStream to buffer it in memory until you've found the required entry.
BufferedInputStream bufIn = new BufferedInputStream(myInputStream);
bufIn.mark(contentLength);
ZipInputStream zipIn = new ZipInputStream(bufIn);
boolean foundSpecial = false;
while ((entry = zin.getNextEntry()) != null) {
if("special.txt".equals(entry.getName())) {
// do whatever you need with the special entry
foundSpecial = true;
break;
}
}
if(foundSpecial) {
// rewind
bufIn.reset();
zipIn = new ZipInputStream(bufIn);
// ....
}
(I haven't tested this code myself, you may find it's necessary to use something like the commons-io CloseShieldInputStream in between the bufIn and the first zipIn, to allow the first zip stream to close without closing the underlying bufIn before you've rewound it).
use the getName() method on ZipEntry to get the file you want.
ZipInputStream zin = new ZipInputStream(myInputStream)
String myFile = "foo.txt";
while ((entry = zin.getNextEntry()) != null)
{
if (entry.getName().equals(myFileName)) {
// process your file
// stop looking for your file - you've already found it
break;
}
}
From Java 7 onwards, you are better off using ZipFile instead of ZipStream if you only want one file and you have a file to read from:
ZipFile zfile = new ZipFile(aFile);
String myFile = "foo.txt";
ZipEntry entry = zfile.getEntry(myFile);
if (entry) {
// process your file
}
Look at Finding a file in zip entry
ZipFile file = new ZipFile("file.zip");
ZipInputStream zis = searchImage("foo.png", file);
public searchImage(String name, ZipFile file)
{
for (ZipEntry e : file.entries){
if (e.getName().endsWith(name)){
return file.getInputStream(e);
}
}
return null;
}
I'm late to the party, but all above "answers" does not answer the question and accepted "answer" suggest create temp file which is inefficient.
Lets create sample zip file:
seq 10000 | sed "s/^.*$/a/"> /tmp/a
seq 10000 20000 | sed "s/^.*$/b/"> /tmp/b
seq 20000 30000 | sed "s/^.*$/c/"> /tmp/c
zip /tmp/out.zip /tmp/a /tmp/b /tmp/c
so now we have /tmp/out.zip file, which contains 3 files, each of them full of chars a, b or c.
Now lets read it:
public static void main(String[] args) throws IOException {
ZipInputStream zipStream = new ZipInputStream(new FileInputStream("/tmp/out.zip"));
ZipEntry zipEntry;
while ((zipEntry = zipStream.getNextEntry()) != null) {
String name = zipEntry.getName();
System.out.println("Entry: "+name);
if (name.equals("tmp/c")) {
byte[] bytes = zipStream.readAllBytes();
String s = new String(bytes);
System.out.println(s);
}
}
}
method readAllBytes seems weird, while we're in processing of stream, but it seems to work, I tested it also on some images, where there is higher chance of failure. So it's probably just unintuitive api, but it seems to work.
I can't seem to import the packages needed or find any online examples of how to extract a .tar.gz file in java.
What makes it worse is I'm using JSP pages and am having trouble importing packages into my project. I'm copying the .jar's into WebContent/WEB-INF/lib/ and then right clicking on the project and selecting import external jar and importing it. Sometimes the packages resolve, other times they don't. Can't seem to get GZIP to import either. The imports in eclipse for jsp aren't intuitive like they are in normal Java code where you can right click a recognized package and select import.
I've tried the Apache commons library, the ice and another one called JTar. Ice has imported, but I can't find any examples of how to use it?
I guess I need to uncompress the gzipped part first, then open it with the tarstream?
Any help is greatly appreciated.
The accepted answer works fine, but I think it is redundant to have a write to file operation.
You could use something like
TarArchiveInputStream tarInput =
new TarArchiveInputStream(new GZipInputStream(new FileInputStream("Your file name")));
TarArchiveEntry currentEntry = tarInput.getNextTarEntry();
while(currentEntry != null) {
File f = currentEntry.getFile();
// TODO write to file as usual
}
Hope this help.
Maven Repo
Ok, i finally figured this out, here is my code in case this helps anyone in the future.
Its written in Java, using the apache commons io and compress librarys.
File dir = new File("directory/of/.tar.gz/files/here");
File listDir[] = dir.listFiles();
if (listDir.length!=0){
for (File i:listDir){
/* Warning! this will try and extract all files in the directory
if other files exist, a for loop needs to go here to check that
the file (i) is an archive file before proceeding */
if (i.isDirectory()){
break;
}
String fileName = i.toString();
String tarFileName = fileName +".tar";
FileInputStream instream= new FileInputStream(fileName);
GZIPInputStream ginstream =new GZIPInputStream(instream);
FileOutputStream outstream = new FileOutputStream(tarFileName);
byte[] buf = new byte[1024];
int len;
while ((len = ginstream.read(buf)) > 0)
{
outstream.write(buf, 0, len);
}
ginstream.close();
outstream.close();
//There should now be tar files in the directory
//extract specific files from tar
TarArchiveInputStream myTarFile=new TarArchiveInputStream(new FileInputStream(tarFileName));
TarArchiveEntry entry = null;
int offset;
FileOutputStream outputFile=null;
//read every single entry in TAR file
while ((entry = myTarFile.getNextTarEntry()) != null) {
//the following two lines remove the .tar.gz extension for the folder name
String fileName = i.getName().substring(0, i.getName().lastIndexOf('.'));
fileName = fileName.substring(0, fileName.lastIndexOf('.'));
File outputDir = new File(i.getParent() + "/" + fileName + "/" + entry.getName());
if(! outputDir.getParentFile().exists()){
outputDir.getParentFile().mkdirs();
}
//if the entry in the tar is a directory, it needs to be created, only files can be extracted
if(entry.isDirectory){
outputDir.mkdirs();
}else{
byte[] content = new byte[(int) entry.getSize()];
offset=0;
myTarFile.read(content, offset, content.length - offset);
outputFile=new FileOutputStream(outputDir);
IOUtils.write(content,outputFile);
outputFile.close();
}
}
//close and delete the tar files, leaving the original .tar.gz and the extracted folders
myTarFile.close();
File tarFile = new File(tarFileName);
tarFile.delete();
}
}
Say we have code like:
File file = new File("zip1.zip");
ZipInputStream zis = new ZipInputStream(new FileInputStream(file));
Let's assume you have a .zip file that contains the following:
zip1.zip
hello.c
world.java
folder1
foo.c
bar.java
foobar.c
How would zis.getNextEntry() iterate through that?
Would it return hello.c, world.java, folder1, foobar.c and completely ignore the files in folder1?
Or would it return hello.c, world.java, folder1, foo.c, bar.java, and then foobar.c?
Would it even return folder1 since it's technically a folder and not a file?
Thanks!
Well... Lets see:
ZipInputStream zis = new ZipInputStream(new FileInputStream("C:\\New Folder.zip"));
try
{
ZipEntry temp = null;
while ( (temp = zis.getNextEntry()) != null )
{
System.out.println( temp.getName());
}
}
Output:
New Folder/
New Folder/folder1/
New Folder/folder1/bar.java
New Folder/folder1/foo.c
New Folder/foobar.c
New Folder/hello.c
New Folder/world.java
Yes. It will print the folder name too, since it's also an entry within the zip. It will also print in the same order as it is displayed inside the zip. You can use below test to verify your output.
public class TestZipOrder {
#Test
public void testZipOrder() throws Exception {
File file = new File("/Project/test.zip");
ZipInputStream zis = new ZipInputStream(new FileInputStream(file));
ZipEntry entry = null;
while ( (entry = zis.getNextEntry()) != null ) {
System.out.println( entry.getName());
}
}
}
Excerpt from: https://blogs.oracle.com/CoreJavaTechTips/entry/creating_zip_and_jar_files
java.util.zip libraries offer some level of control for the added entries of the ZipOutputStream.
First, the order you add entries to the ZipOutputStream is the order they are physically located in the .zip file.
You can manipulate the enumeration of entries returned back by the entries() method of ZipFile to produce a list in alphabetical or size order, but the entries are still stored in the order they were written to the output stream.
So I would believe that you have to use the entries() method to see the order in which it will be iterated through.
ZipFile zf = new ZipFile("your file path with file name");
for (Enumeration<? extends ZipEntry> e = zf.entries();
e.hasMoreElements();) {
System.out.println(e.nextElement().getName());
}
The zip file internal directory is a "flat" list of all the files and directories in the zip. getNextEntry will iterate through the list and sequentially identify every file and directory in the zip file.
There is a variant of the zip file format that has no central directory, in which case (if it's handled at all) I suspect you'd iterate through all actual files in the zip, skipping directories (but not skipping files in directories).