Programatically Extract Single Specific File From 7zip Archive - Java - Linux

Programatically Extract Single Specific File From 7zip Archive - Java - Linux - java

I would really appreciate your input on the below scenario please.
The requirements:
- I have a 7zip archive file with several thousands of files in it
- I have a java application running on linux that is required to retrieve individual files from the 7 zip file
I would like to retrieve a file from the archive by its path (e.g. my7zFile.7z/file1.pdf) without having to iterate through all the files in the archive and comparing file names.
I would like to avoid having to extract all files from the archive before running the search (the uncompressed archive is several TB).
I had a look into 7zip Java Binding - specifically the IInArchive class, the only extract method seems to work via file index, not via file name:
http://sevenzipjbind.sourceforge.net/javadoc/net/sf/sevenzipjbinding/IInArchive.html
Do you know of any other libraries that could help me with this use case or am I overlooking a way of doing this with 7zip jbinding?
Thank you
Kind regards,
Tobi

Sadly it appears the API doesn't provide enough to fulfill all your requirements. In order to extract a single file it appears you need to walk the archive index. The simplified interface to the archive makes this much easier:
The ISimpleInArchive interface provides:
ISimpleInArchiveItem[] getArchiveItems()
Allowing you to retrieve an list of items in the archive.
The ISimpleInArchiveItem interface provides the method:
java.lang.String getPath()
Hence you can walk the archiveItems comparing on path. Granted this is against your requirements.
However, note this walks the index table and does not extract the files until requested. Once you have the item your after you can use:
ExtractOperationResult extractSlow(ISequentialOutStream SequentialOutStream)
on the item you have found to actually extract it.
Looking at the 7z file format (note this is not the official site of 7zip), the header information is all at the end of the file with the Signature header at the start of the file giving an offset to the start of the header info. So provided the SevenZip bindings are written nicely, your search will at most read the start of the file (SignatureHeader) to find the offset to the HeaderInfo section, then walk the HeaderInfo section in order to build up the file list required in getArchiveItems(). Only once you have the item you need will it shift back to the index of the actual stream for the file you want extracted (most likely when you call extractSlow).
So whilst not all your requirements are met, the overhead of the search/compare required is limited to only searching the header info of the archive.

Once I wrote a code to read from all the files and folders from a zip file. I had a long file(text)/folder hierarchy inside the zip file. I am not sure whether that will help you or not. I am sharing the skeleton of the code.
import java.util.zip.ZipEntry;
import java.util.zip.ZipFile;
ZipFile zipFile = new ZipFile(filepath); // filepath of the zip file
Enumeration<? extends ZipEntry> entries = zipFile.entries();
while (entries.hasMoreElements()) {
ZipEntry entry = entries.nextElement();
if (entry.isDirectory()) { // found directory inside the zipFile
// write your code here
} else {
InputStream stream = zipFile.getInputStream(entry);
BufferedReader reader = new BufferedReader(new InputStreamReader(stream));
// write your code to read the content of the file
}
}
You can modify the code to reach your desired file in the zip. But i don't think you will be able to access the file directly rather you have to walk through all the paths of the zip archive. Note that, ZipFile iterates through all file and folders inside a zipped file in DFS (Depth First Search) manner. You will find detailed relevant examples in web.

Related

Processing/Java File Count Issue With File Pathway (Variable Type)

Although the Title isn't very understandable I do have a simple issue. So i'm trying to write some code in a Processing Sketch (https://processing.org/) which can count how many files are in a document. The problem is, is that it doesn't accept the variable type.
File folder = File("My File Path");
folder.listFiles().size;
It says the function File(String) doesn't exist. When I try to put the file path without quation marks, it still doesn't work!
If you have a solution then please use a functioning example so that I know how it works. Thanks for any help!

As Joakim Danielson says it is constructor so you need to use new keyword.
Below code will work for you.
File folder = new File("My File Path");
int fileLength = folder.listFiles().length;

It's a constructor so you need to use new
File folder = new File("My File Path");
//To get the number of files in the folder
folder.listFiles().length;

Assuming the "My File Path" folder is inside your sketch you need to provide the path to your sketch. Luckily Processing already provides a helper function: sketchPath()
Here's an example:
File folder = new File(sketchPath("My File Path"));
println("folder.exists: " + folder.exists());
if(folder.exists()){
println(folder.listFiles().length + " files and/or directories");
}else{
println("folder does not exist, double check the path");
}
Bare in mind there's also a dataPath() function which points to a folder named data in your sketch folder. The data folder is typically used for storing external data (e.g. assets (raster or vector images/Processing font files) or raw data (binary/text/csv/xml/json/etc.)). This is useful to separate your sketch source files from the data to be loaded/accessed by your sketch.
Also, Processing has a few utility functions for listing files and folders.
Be sure to check out Processing > Examples > Topics > File IO > DirectoryList
The example includes less documented functions such as listFiles() (which returns an array of java.io.File objects based on the filters set) or listPaths (which returns an array of String objects: just the paths).
The options and filters are quite handy, for example if you want to list directories only and ignore files you can simply write simply like:
println("directories: " + listFiles(sketchPath("My File Path"),"directories").length);
For example if want to list all the wav files in a data/audio directory inside the sketch you can use:
File[] files = listFiles(dataPath("audio"), "files", "extension=wav");
This will ignore directories and any other file that does not have .wav extension.
To make this answer complete, here are a few more details on the options for listFiles/listPaths from Processing's source code:
"relative" -> no effect with the Files version, but important for listPaths
"recursive"-> traverse nested directories
"extension=js" or "extensions=js|csv|txt" (no dot)
"directories" -> only directories
"files" -> only files
"hidden" -> include hidden files (prefixed with .) disabled by default

Reading a file from tar.gz archive in Spark

I have a bunch of tar.gz files which I would like to process with Spark without decompressing them.
A single archive is about ~700MB and contains 10 different files but I'm interested only in one of them (which is ~7GB after decompression).
I know that context.textFile supports tar.gz but I'm not sure is it the right tool when an archive contains more then one file. What happens is that Spark will return content of all files (line by line) in the archive including file names with some binary data.
Is there any way to select which file from tar.gz I would like to map?

AFAIK, I'd suggest sc.binaryFiles method... please see below doc. where file name and file content are present, you can map and pickup the file you want and process that.
public RDD<scala.Tuple2<String,PortableDataStream>> binaryFiles(String path,
int minPartitions)
Get an RDD for a Hadoop-readable dataset as PortableDataStream for each file (useful for binary data)
For example, if you have the following files:
hdfs://a-hdfs-path/part-00000
hdfs://a-hdfs-path/part-00001
...
hdfs://a-hdfs-path/part-nnnnn
Do val rdd = sparkContext.binaryFiles("hdfs://a-hdfs-path"),
then rdd contains
(a-hdfs-path/part-00000, its content)
(a-hdfs-path/part-00001, its content)
...
(a-hdfs-path/part-nnnnn, its content)
Also, check this

Read tgz w/out unpacking it onto computer or Unpack as temp & delete when program closes?

Hey guys I'm currently using jarchivelib which can be found Here I'm stuck on figuring out a way to read the file without having to use the unpack method because it makes a file of the unpacked version. EX:
File archive = new File("/home/jack/archive.zip");
File destination = new File("/home/jack/archive");
Archiver archiver = ArchiverFactory.createArchiver(ArchiveFormat.ZIP);
archiver.extract(archive, destination);
I want to make it so i don't have to unpack it to read the files... If there is no way to do that I'm guessing in my method for Jframe.setDefualtCloseOpperation i'll have to make a custom one so it deletes the files? or is there a better way for handling temp files?

If all you want to do is to extract the file, why not use Java's built in zip to extract the file or if it is password protected you can use Zip4j. These libraries support streams, so that you can extract the contents of the file without writing it a FileStream

As of version 0.4.0, the jarchivelib Archiver API supports streaming an archive rather than extracting it directly onto the filesystem.
ArchiveStream stream = archiver.stream(archive);
ArchiveEntry entry;
while((entry = stream.getNextEntry()) != null) {
// access each archive entry individually using the stream
// or extract it using entry.extract(destination)
// or fetch meta-data using entry.getName(), entry.isDirectory(), ...
}
stream.close();
when the stream is pointing to an entry after calling getNextEntry, you can use the stream.read methods just as you would reading an individual entry.

List .zip directories without extracting

I am building a file explorer in Java and I am listing the files/folders in JTrees. What I am trying to do now is when I get to a zipped folder I want to list its contents, but without extracting it first.
If anyone has an idea, please share.

I suggest you have a look at ZipFile.entries().
Here's some code:
try (ZipFile zipFile = new ZipFile("test.zip")) {
Enumeration<? extends ZipEntry> zipEntries = zipFile.entries();
while (zipEntries.hasMoreElements()) {
String fileName = zipEntries.nextElement().getName();
System.out.println(fileName);
}
}
If you're using Java 8, you can avoid the use of the almost deprecated Enumeration class using ZipFile::stream as follows:
zipFile.stream()
.map(ZipEntry::getName)
.forEach(System.out::println);
If you need to know whether an entry is a directory or not, you could use ZipEntry.isDirectory. You can't get much more information than than without extracting the file (for obvious reasons).
If you want to avoid extracting all files, you can extract one file at a time using ZipFile.getInputStream for each ZipEntry. (Note that you don't need to store the unpacked data on disk, you can just read the input stream and discard the bytes as you go.

Use java.util.zip.ZipFile class and, specifically, its entries method.
You'll have something like this:
ZipFile zipFile = new ZipFile("testfile.zip");
Enumeration zipEntries = zipFile.entries();
String fname;
while (zipEntries.hasMoreElements()) {
fname = ((ZipEntry)zipEntries.nextElement()).getName();
...
}

For handling ZIP files you can use class ZipFile. It has method entries() which returns list of entries contained within ZIP file. This information is contained in the ZIP header and extraction is not required.

How to extract a single file from a remote archive file?

Given
URL of an archive (e.g. a zip file)
Full name (including path) of a file inside that archive
I'm looking for a way (preferably in Java) to create a local copy of that file, without downloading the entire archive first.
From my (limited) understanding it should be possible, though I have no idea how to do that. I've been using TrueZip, since it seems to support a large variety of archive types, but I have doubts about its ability to work in such a way. Does anyone have any experience with that sort of thing?
EDIT: being able to also do that with tarballs and zipped tarballs is also important for me.

Well, at a minimum, you have to download the portion of the archive up to and including the compressed data of the file you want to extract. That suggests the following solution: open a URLConnection to the archive, get its input stream, wrap it in a ZipInputStream, and repeatedly call getNextEntry() and closeEntry() to iterate through all the entries in the file until you reach the one you want. Then you can read its data using ZipInputStream.read(...).
The Java code would look something like this:
URL url = new URL("http://example.com/path/to/archive");
ZipInputStream zin = new ZipInputStream(url.getInputStream());
ZipEntry ze = zin.getNextEntry();
while (!ze.getName().equals(pathToFile)) {
zin.closeEntry(); // not sure whether this is necessary
ze = zin.getNextEntry();
}
byte[] bytes = new byte[ze.getSize()];
zin.read(bytes);
This is, of course, untested.

Contrary to the other answers here, I'd like to point out that ZIP entries are compressed individually, so (in theory) you don't need to download anything more than the directory and the entry itself. The server would need to support the Range HTTP header for this to work.
The standard Java API only supports reading ZIP files from local files and input streams. As far as I know there's no provision for reading from random access remote files.
Since you're using TrueZip, I recommend implementing de.schlichtherle.io.rof.ReadOnlyFile using Apache HTTP Client and creating a de.schlichtherle.util.zip.ZipFile with that.
This won't provide any advantage for compressed TAR archives since the entire archive is compressed together (beyond just using an InputStream and killing it when you have your entry).

Since TrueZIP 7.2, there is a new client API in the module TrueZIP Path. This is an implementation of an NIO.2 FileSystemProvider for JSE 7. Using this API, you can access HTTP URI as follows:
Path path = new TPath(new URI("http://acme.com/download/everything.tar.gz/README.TXT"));
try (InputStream in = Files.newInputStream(path)) {
// Read archive entry contents here.
...
}

I'm not sure if there's a way to pull out a single file from a ZIP without downloading the whole thing first. But, if you're the one hosting the ZIP file, you could create a Java servlet which reads the ZIP file and returns the requested file in the response:
public class GetFileFromZIPServlet extends HttpServlet{
#Override
public void doGet(HttpServletRequest request, HttpServletResponse response)
throws ServletException, IOException{
String pathToFile = request.getParameter("pathToFile");
byte fileBytes[];
//get the bytes of the file from the ZIP
//set the appropriate content type, maybe based on the file extension
response.setContentType("...");
//write file to the response
response.getOutputStream().write(fileBytes);
}
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.