java- library to extract/compress files from/ into archives - java

I am looking for a java library that extracts files from archives/ compresses files to create archives.
Given below are my requirements--
(1) It should extract widest range/types of archives
(2) Simple and requiring only few lines of code (unlike java's default zip class).
(3) Production -grade: without defects.

Apache commons compress seems to be a good fit, it can deal with quite alot of different compression types and is very straight forward to use.
http://commons.apache.org/compress/index.html

Related

Efficient LZ4 multiple file compression using java

I took adrien grand's java repository providing JNI to the original LZ4 native code.
I want to compress multiple files under a given input directory, but LZ4 doesn't support multiple file compression like in java zip package so I tried another approach where I thought of to tar all my input files and pipe it as input to LZ4 compressor, and I used Jtar java package for taring all my input files. Is there any other better way other than this?
I came across many sample codes to compress some strings and how to correctly implement the LZ4 compressor and decompressor. Now I wanted to know how to actually implement it for multiple files? I also wanted to clarify whether I'm going in the correct direction.
After taring all files, according to sample code usage explanation I've to convert my tared file now to byte array to provide it to compressor module. I used apache-common-ioutil package for this purpose. So considering I've many files as input and which results in a tar of huge size, converting it always to byte array seems ineffective according to me. I wanted to first know whether this is effective or not? or is there any better way of using LZ4 package better than this?
Another problem that I came across was the end result. After compression of the tared files I would get an end result like MyResult.lz4 file as output but I was not able to decompress it using the archive manager ( I'm using ubuntu ) as it doesn't support this format. I'm also not clear about the archive and compression format that I have to use here. I also want to know what format should the end result be in. So now I'm speaking from an user point of view, consider a case where I'm generating a backup for the user if I provide him/her with traditional .zip, .gz or any known formats, the user would be in a position to decompress it by himself. As I know LZ4 doesn't mean I've to expect the user also should know such format right? He may even get baffled on seeing such a format. So this means a conversion from .lz4 to .zip format also seems meaningless. I already see the taring process of all my input files as a time consuming process, so I wanted to know how much it affects the performance. As I've seen in java zip package compressing multiple input files didn't seem to be a problem at all. So next to lz4 I came across Apache common compress and TrueZIP. I also came across several stack overflow links about them which helped me learn a lot. As of now I really wanted to use LZ4 for compression especially due it's performance but I came across these hurdles. Can anyone who has a good knowledge about LZ4 package provide solutions to all my queries and problems along with a simple implementation. Thanks.
Time I calculated for an input consisting of many files,
Time taken for taring : 4704 ms
Time taken for converting file to byte array : 7 ms
Time Taken for compression : 33 ms
Some facts:
LZ4 is no different here than GZIP: it is a single-concern project, dealing with compression. It does not deal with archive structure. This is intentional.
Adrien Grand's LZ4 lib produces output incompatible with the command-line LZ4 utility. This is also intentional.
Your approach with tar seems OK becuase that's how it's done with GZIP.
Ideally you should make the tar code produce a stream which is immediately compressed instead of first being entirely stored in RAM. This is what is achieved at the command line using Unix pipes.
I had the same problem. The current release of LZ4 for Java is incompatible with the later developed LZ4 standard to handle streams, however, in the projects repo there is a patch that supports the standard to compress/decompress streams, and I can confirm it is compatible with the command line tool. You can find it here https://github.com/jpountz/lz4-java/pull/61 .
In Java you can use that together with TarArchiveInputStream from the Apache Commons compress.
If you want an example, the code I use is in the Maven artifact io.github.htools 0.27-SNAPSHOT (or at github) the classes io.github.htools.io.compressed.TarLz4FileWriter and (the obsolete
class) io.github.htools.io.compressed.TarLz4File show how it works. In HTools, tar and lz4 are automatically used through ArchiveFile.getReader(String filename) and ArchiveFileWriter(String filename, int compressionlevel) provided your filename ends with .tar.lz4
You can chain IOStreams together, so using something like Tar Archive from Apache Commons and LZ4 from lz4-java,
try (LZ4FrameOutputStream outputStream = new LZ4FrameOutputStream(new FileOutputStream("path/to/myfile.tar.lz4"));
TarArchiveOutputStream taos = new TarArchiveOutputStream (outputStream)) {
...
}
Consolidating the bytes into a byte array will cause a bottleneck as you are not trying to hold the entire stream in-memory which can easily run into OutOfMemory problems with large streams. Instead, you'll want to pipeline the bytes through all the IOStreams like above.
I created a Java library that does this for you https://github.com/spoorn/tar-lz4-java.
If you want to implement it yourself, here's a technical doc that includes details on how to LZ4 compress a directory using TarArchive from Apache Commons and lz4-java: https://github.com/spoorn/tar-lz4-java/blob/main/SUMMARY.md#lz4

How to create a directory in memory? pseudo file system / virtual directory

For my usecase, I would like to have an in memory directory to store some files for a very short time. Actually I compile source code to .class files at runtime, classload and execute them. The clean way would be to create a virtual directory and let the compiler create .class files there. Of course I could use a temp directory, but I have to clean it before compiling, I don't know if I'm the only one using it, etc.
So, is and how is it possible to create a virtual, in the meaning of in memory, directory in Java?
In Java 6 it is not really possible to do this kind of thing within a Java application. You need to rely on the OS platform to provide the the pseudo-filesystem.
In Java 7, the NIO APIs have been extended to provide an API that allows you to define new file systems; see FileSystemProvider.
Apache Commons VFS is another option, but it has a couple of characteristics that may cause problems for existing code and (3rd-party) libraries:
Files and directories in VFS are named using urls, not File objects. So code that uses File for file manipulation won't work.
FileInputStream, FileOutputStream, FileReader and FileWriter won't work with VFS for much the same reason.
It sounds like you could use a ramdisk. There are many apps out there that will do this, what you use would depend on the target OS. I don't know of any native Java API that supports this.
I am not sure if this is helpful or not, but do check Apache Commons VFS.
It seems that what you need is memory filesystem.
For Java7's NIO there are
https://github.com/marschall/memoryfilesystem and
http://exitcondition.alrubinger.com/2012/08/17/shrinkwrap-nio2/

Handling ZIP content in Java that uses the SHRINK algorithm

Anyone know of a way to handle ZIP files produced using the SHRINK algorithm? This doesn't appear to be supported by the standard Java ZIP functionality.
We're receiving ZIP files from an upstream system that (amazingly) have SHRINK-based compression in use. This seems to be from an older mainframe-based ZIP encoder that can't be easily modified to use something more modern.
In the interests of accepting an answer, it sounds like it's not possible to do directly in Java without porting code or building a JNI layer to hit native libraries that can handle this.

Existing solution for file deltas/versioning in Java

When versioning or optimizing file backups one idea is to use only the delta or data that has been modified.
This sounds like a simple idea at first but actually determining where unmodified data ends and new data starts comes accross as a difficult task.
Is there an existing framework that already does something like this or an efficient file comparison algorithm?
XDelta is not Java but is worth looking at anyway. There is Java version of it but I don't know how stable is it.
Instead of rolling your own, you might consider leveraging an open source version control system (eg, Subversion). You get a lot more than just a delta versioning algorithm that way.
It sounds like you are describing a difference based storage scheme. Most source code control systems use such systems to minimize their storage requirements. The *nix "diff" command is capable of generating the data you would need to implement it on your own.
Here's a Java library that can compute diffs between two plain text files:
http://code.google.com/p/google-diff-match-patch/
I don't know any library for binary diffs though. Try googling for 'java binary diff' ;-)
As for my opinion, Bsdiff tool is the best choice for binary files. It uses suffix sorting (Larsson and Sadakane's qsufsort) and takes advantage of how executable files change. Bsdiff was written in C++ by Colin Percival. Diff files created by Bsdiff are generally smaller than the files created by Xdelta.
It is also worth noting that Bsdiff uses bzip2 compression algorithm. Binary patches created by Bsdiff sometimes can be further compressed using other compression algorithms (like the WinRAR archiver's one).
Here is the site where you can find Bsdiff documentation and download Bsdiff for free: http://www.daemonology.net/bsdiff/

Is a Java JAR file similar to an .Net Assembly?

I'm familiar with .Net and what assemblies are - but sadly my Java knowledge isn't as strong.
I know Java and .Net are different "worlds" (possibly like comparing apples with pears) but are JARs and .Net Assemblies roughly eqivalent concepts?
Edit: Update base on initial responses
The way I interpret this is that yes they have similarities:
Both can contain resources and metadata.
But there's some core differences:
a .Net assembly is compiled a JAR isn't.
JAR files aren't required to make a Java application; .Net assemblies are required for a .Net application.
[This isn't time for a religious war - I'd like to know if / how much of my understanding of .Net Assemblies I can apply to getting my head around Java (and maybe this will even help Java folks going the other way).]
There's a bunch of technical differences, but they are of little consequence most of the time, so basically, YES, they are the same concept.
I would say no, they are not the same concept, noting that a JAR can be used like an assembly. If you really want to get your head around a JAR file, just think of it as a ZIP file. That's all it really is.
Most often, that archive contains compiled class files. And most often, those class files are arranged in a hierarchal fashion corresponding to the class's package.
But JAR files frequently contain other stuff, such as message bundles, images, and even the source files. I'd encourage you to crack one open with the unzip client of your choice and take a look inside.
The JAR format is however the most common way of packaging a distributable for a Java library or application so in that way they are very similar.
From a language standpoint, JAR files are in no way required to make a Java application or library, nor would I say they are intrinsic to Java, however both the standard library and the JDK has support for dealing with JAR files.
At one level, they are conceptually similar (chunk of byte code in a package). However, their implementation is very different.
A JAR file is actually a ZIP file containing some metadata, and all of the Java .class files (yes you can actually open it as a ZIP file and see inside), and whatever else was packaged up in it (resources, etc).
A .NET assembly is actually a valid Win32 PE file, and can be treated as such to some extent. A .NET .exe file actually begins executing as native code which loads the framework, which then load the bytecode. "Executing" a JAR file requires launching the Java runtime through a .bat file, or file association, which then loads the JAR file separately.

Categories

Resources