Efficient LZ4 multiple file compression using java

Efficient LZ4 multiple file compression using java - java

I took adrien grand's java repository providing JNI to the original LZ4 native code.
I want to compress multiple files under a given input directory, but LZ4 doesn't support multiple file compression like in java zip package so I tried another approach where I thought of to tar all my input files and pipe it as input to LZ4 compressor, and I used Jtar java package for taring all my input files. Is there any other better way other than this?
I came across many sample codes to compress some strings and how to correctly implement the LZ4 compressor and decompressor. Now I wanted to know how to actually implement it for multiple files? I also wanted to clarify whether I'm going in the correct direction.
After taring all files, according to sample code usage explanation I've to convert my tared file now to byte array to provide it to compressor module. I used apache-common-ioutil package for this purpose. So considering I've many files as input and which results in a tar of huge size, converting it always to byte array seems ineffective according to me. I wanted to first know whether this is effective or not? or is there any better way of using LZ4 package better than this?
Another problem that I came across was the end result. After compression of the tared files I would get an end result like MyResult.lz4 file as output but I was not able to decompress it using the archive manager ( I'm using ubuntu ) as it doesn't support this format. I'm also not clear about the archive and compression format that I have to use here. I also want to know what format should the end result be in. So now I'm speaking from an user point of view, consider a case where I'm generating a backup for the user if I provide him/her with traditional .zip, .gz or any known formats, the user would be in a position to decompress it by himself. As I know LZ4 doesn't mean I've to expect the user also should know such format right? He may even get baffled on seeing such a format. So this means a conversion from .lz4 to .zip format also seems meaningless. I already see the taring process of all my input files as a time consuming process, so I wanted to know how much it affects the performance. As I've seen in java zip package compressing multiple input files didn't seem to be a problem at all. So next to lz4 I came across Apache common compress and TrueZIP. I also came across several stack overflow links about them which helped me learn a lot. As of now I really wanted to use LZ4 for compression especially due it's performance but I came across these hurdles. Can anyone who has a good knowledge about LZ4 package provide solutions to all my queries and problems along with a simple implementation. Thanks.
Time I calculated for an input consisting of many files,
Time taken for taring : 4704 ms
Time taken for converting file to byte array : 7 ms
Time Taken for compression : 33 ms

Some facts:
LZ4 is no different here than GZIP: it is a single-concern project, dealing with compression. It does not deal with archive structure. This is intentional.
Adrien Grand's LZ4 lib produces output incompatible with the command-line LZ4 utility. This is also intentional.
Your approach with tar seems OK becuase that's how it's done with GZIP.
Ideally you should make the tar code produce a stream which is immediately compressed instead of first being entirely stored in RAM. This is what is achieved at the command line using Unix pipes.

I had the same problem. The current release of LZ4 for Java is incompatible with the later developed LZ4 standard to handle streams, however, in the projects repo there is a patch that supports the standard to compress/decompress streams, and I can confirm it is compatible with the command line tool. You can find it here https://github.com/jpountz/lz4-java/pull/61 .
In Java you can use that together with TarArchiveInputStream from the Apache Commons compress.
If you want an example, the code I use is in the Maven artifact io.github.htools 0.27-SNAPSHOT (or at github) the classes io.github.htools.io.compressed.TarLz4FileWriter and (the obsolete
class) io.github.htools.io.compressed.TarLz4File show how it works. In HTools, tar and lz4 are automatically used through ArchiveFile.getReader(String filename) and ArchiveFileWriter(String filename, int compressionlevel) provided your filename ends with .tar.lz4

You can chain IOStreams together, so using something like Tar Archive from Apache Commons and LZ4 from lz4-java,
try (LZ4FrameOutputStream outputStream = new LZ4FrameOutputStream(new FileOutputStream("path/to/myfile.tar.lz4"));
TarArchiveOutputStream taos = new TarArchiveOutputStream (outputStream)) {
...
}
Consolidating the bytes into a byte array will cause a bottleneck as you are not trying to hold the entire stream in-memory which can easily run into OutOfMemory problems with large streams. Instead, you'll want to pipeline the bytes through all the IOStreams like above.
I created a Java library that does this for you https://github.com/spoorn/tar-lz4-java.
If you want to implement it yourself, here's a technical doc that includes details on how to LZ4 compress a directory using TarArchive from Apache Commons and lz4-java: https://github.com/spoorn/tar-lz4-java/blob/main/SUMMARY.md#lz4

Related

Is it more efficient to read from an Excel file or an CSV file?

I need to write a quick program (using Java since its the only language I am really comfortable with) that takes an Excel file (or CSV) and parces through the data adding information that might be missing.
The problem Im having is that I cant decide how to start, it feels like manipulating an Excel file would be easier but reading through a CSV file would be really simple.
Any insight on problems that might come up or maybe a third solution that I'm ignoring.
The excel document is basically just a mini audited database of printer IPs, names, manufacturers, and locations.
Edit: General consensus seems to be that CSV is a lot more easy to manipulate and since Im wanting to write a quick script that can be ran I think downloading the extra library for excel manipulation would be a hassel.
Going to start writing the code today or monday, will probably have more questions later in the week. Thank you everyone for your help! Venturing into new territory with my first job.

If reading a CSV is an option in your situation, I would definitely go for it, because you can do it in a way that is both system-independent and portable without using external libraries.
As far as the efficiency goes, the timing is very likely going to be I/O dominated, so the smaller the file - the faster you are going to read it in.
Adding the missing information and writing the file back may be a bit tricky because of the need to properly handle quotes, but it is still a lot simpler than accessing an Excel file through a special-purpose library.

CSV willl be easier since you do not need any additional libraries like jxl. Refer to this read and write CSV tutorial

500x10 is really quite small so difficult to imagine a lot of code would be required. If sticking with Excel its inbuilt features (Find/Replace, Sort, Filter, PivotTable, Copy down etc) I would expect to be sufficient.

java- library to extract/compress files from/ into archives

I am looking for a java library that extracts files from archives/ compresses files to create archives.
Given below are my requirements--
(1) It should extract widest range/types of archives
(2) Simple and requiring only few lines of code (unlike java's default zip class).
(3) Production -grade: without defects.

Apache commons compress seems to be a good fit, it can deal with quite alot of different compression types and is very straight forward to use.
http://commons.apache.org/compress/index.html

Handling ZIP content in Java that uses the SHRINK algorithm

Anyone know of a way to handle ZIP files produced using the SHRINK algorithm? This doesn't appear to be supported by the standard Java ZIP functionality.
We're receiving ZIP files from an upstream system that (amazingly) have SHRINK-based compression in use. This seems to be from an older mainframe-based ZIP encoder that can't be easily modified to use something more modern.

In the interests of accepting an answer, it sounds like it's not possible to do directly in Java without porting code or building a JNI layer to hit native libraries that can handle this.

Existing solution for file deltas/versioning in Java

When versioning or optimizing file backups one idea is to use only the delta or data that has been modified.
This sounds like a simple idea at first but actually determining where unmodified data ends and new data starts comes accross as a difficult task.
Is there an existing framework that already does something like this or an efficient file comparison algorithm?

XDelta is not Java but is worth looking at anyway. There is Java version of it but I don't know how stable is it.

Instead of rolling your own, you might consider leveraging an open source version control system (eg, Subversion). You get a lot more than just a delta versioning algorithm that way.

It sounds like you are describing a difference based storage scheme. Most source code control systems use such systems to minimize their storage requirements. The *nix "diff" command is capable of generating the data you would need to implement it on your own.

Here's a Java library that can compute diffs between two plain text files:
http://code.google.com/p/google-diff-match-patch/
I don't know any library for binary diffs though. Try googling for 'java binary diff' ;-)

As for my opinion, Bsdiff tool is the best choice for binary files. It uses suffix sorting (Larsson and Sadakane's qsufsort) and takes advantage of how executable files change. Bsdiff was written in C++ by Colin Percival. Diff files created by Bsdiff are generally smaller than the files created by Xdelta.
It is also worth noting that Bsdiff uses bzip2 compression algorithm. Binary patches created by Bsdiff sometimes can be further compressed using other compression algorithms (like the WinRAR archiver's one).
Here is the site where you can find Bsdiff documentation and download Bsdiff for free: http://www.daemonology.net/bsdiff/

How can i monitor system statistics in kubuntu using Java?

i am doing a project related to configuration and memory analyzer for kubuntu.
i want to display the system statistics information like CPU usage, RAM usage and proceses etc. graphically using an odometer.
i wanted to know if there is any great open source library for graphical component like odometers and other graphing utilities.
also another problem is that i have to get information of cpu from somewhere and parse it and feed it into the odometer for display.
one method may be that i use command line utilities and parse the results and feed to the graphical component.
another option is that there is a library called libstatgrab which is written in complete C and i need to use JNI.
i dont like both these approaches because i am a little short on time and need a library that can do these things for me. there is a binding library present for Python to libstatgrab but not to java.
and if any one has any other approach, please write up.

For collecting the statistics, I would read directly from /proc or /sys, since they're just text files which are readily parseable (slightly moreso than exec()ing a command line tool and reading its output). Look at /proc/meminfo, /proc/loadavg, /proc/stat and others.
You can look at the C source of the procps package to see how these files are worked with by running
apt-get source procps
In there, you can look at how top.c reads the /proc/stat file.
As for charting, the "bog standard" plotting library is JFreeChart.

there is a binding library present for
Python to libstatgrab but not to java
Use jython?

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Efficient LZ4 multiple file compression using java - java

Related

Is it more efficient to read from an Excel file or an CSV file?

java- library to extract/compress files from/ into archives

Handling ZIP content in Java that uses the SHRINK algorithm

Existing solution for file deltas/versioning in Java

How can i monitor system statistics in kubuntu using Java?

Categories

Resources