When versioning or optimizing file backups one idea is to use only the delta or data that has been modified.
This sounds like a simple idea at first but actually determining where unmodified data ends and new data starts comes accross as a difficult task.
Is there an existing framework that already does something like this or an efficient file comparison algorithm?
XDelta is not Java but is worth looking at anyway. There is Java version of it but I don't know how stable is it.
Instead of rolling your own, you might consider leveraging an open source version control system (eg, Subversion). You get a lot more than just a delta versioning algorithm that way.
It sounds like you are describing a difference based storage scheme. Most source code control systems use such systems to minimize their storage requirements. The *nix "diff" command is capable of generating the data you would need to implement it on your own.
Here's a Java library that can compute diffs between two plain text files:
http://code.google.com/p/google-diff-match-patch/
I don't know any library for binary diffs though. Try googling for 'java binary diff' ;-)
As for my opinion, Bsdiff tool is the best choice for binary files. It uses suffix sorting (Larsson and Sadakane's qsufsort) and takes advantage of how executable files change. Bsdiff was written in C++ by Colin Percival. Diff files created by Bsdiff are generally smaller than the files created by Xdelta.
It is also worth noting that Bsdiff uses bzip2 compression algorithm. Binary patches created by Bsdiff sometimes can be further compressed using other compression algorithms (like the WinRAR archiver's one).
Here is the site where you can find Bsdiff documentation and download Bsdiff for free: http://www.daemonology.net/bsdiff/
Related
I have to compare using Spark-based big data analysis data sets (text files) that are very similar (>98%) but with very large sizes. After doing some research, I found that most efficient way could be to use delta encoders. With this I can have a reference text and store others as delta increments. However, I use Scala that does not have support for delta encoders, and I am not at all conversant with Java. But as Scala is interoperable with Java, I know that it is possible to get Java lib work in Scala.
I found the promising implementations to be xdelta, vcdiff-java and bsdiff. With a bit more searching, I found the most interesting library, dez. The link also gives benchmarks in which it seems to perform very well, and code is free to use and looks lightweight.
At this point, I am stuck with using this library in Scala (via sbt). I would appreciate any suggestions or references to navigate this barrier, either specific to this issue (delata encoders), library or in working with Java API in general within Scala. Specifically, my questions are:
Is there a Scala library for delta encoders that I can directly use? (If not)
Is it possible that I place the class files/notzed.dez.jar in the project and let sbt provide the APIs in the Scala code?
I am kind of stuck in this quagmire and any way out would be greatly appreciated.
There are several details to take into account. There is no problem in using directly the Java libraries in Scala, either using as dependencies in sbt or using as unmanaged dependencies https://www.scala-sbt.org/1.x/docs/Library-Dependencies.html: "Dependencies in lib go on all the classpaths (for compile, test, run, and console)". You can create a fat jar with your code and dependencies with https://github.com/sbt/sbt-native-packager and distributed it with Spark Submit.
The point here is to use these frameworks in Spark. To take advantage of Spark you would need split your files in blocks to distribute the algorithm across the cluster for one file. Or if your files are compressed and you have each of them in one hdfs partition you would need to adjust the size of the hdfs blocks, etc ...
You can use the C modules and include them in your project and call them via JNI as frameworks like deep learning frameworks use the native linear algebra functions, etc. So, in essence, there are a lot to discuss about how to implement these delta algorithms in Spark.
I am trying to implement a data deduplication program in the cloud using Java.
I'm not sure how to proceed with the implementation.
First, I wanted to do a simple file compare of the file size, date and name of the file. However, this is ineffective since the file might have same content but a different name.
I have decided on a simple algorithm which is
file upload -> file chunking -> Rabin-karp hashing -> determine to see whether can upload file.
Will this be fine or are there any improvements?
Where would I be able to find out more information on this? I have tried looking around the Internet but I can't find anything. Most of it is just broken down into certain implementations but without explanation or details on file chunking or Rabin-karp hashing.
I would want to know about which Java libraries I should look into regarding this program.
It would be easier if you state your problem constraints. Assuming the following:
The smallest indivisible unit of data is a file
Files are reasonably small to fit in memory for computing hashes
Your files are in some cloud bucket or something where you can list them all. Also that eliminates identical filenames.
You can probably narrow down your problem.
Iterate through all the files in all the files using some fast hashing algorithm like a basic CRC checksum and build a map. (Can be easily parallelized).
Filter out all the files which have a collision. You can easily leave out the rest of the files which for all practical purposes should be a pretty reasonable chunk of the data.
Run through this remaining subset of files with a cryptographic hash (or worst case, match the entire files) and identify matches.
This can be refined depending on the underlying data.
However, this is how I would approach the problem and given the structure of it; this problem can be easily partitioned and solved in a parallel manner. Feel free to elaborate more so that we can reach a good solution.
I took adrien grand's java repository providing JNI to the original LZ4 native code.
I want to compress multiple files under a given input directory, but LZ4 doesn't support multiple file compression like in java zip package so I tried another approach where I thought of to tar all my input files and pipe it as input to LZ4 compressor, and I used Jtar java package for taring all my input files. Is there any other better way other than this?
I came across many sample codes to compress some strings and how to correctly implement the LZ4 compressor and decompressor. Now I wanted to know how to actually implement it for multiple files? I also wanted to clarify whether I'm going in the correct direction.
After taring all files, according to sample code usage explanation I've to convert my tared file now to byte array to provide it to compressor module. I used apache-common-ioutil package for this purpose. So considering I've many files as input and which results in a tar of huge size, converting it always to byte array seems ineffective according to me. I wanted to first know whether this is effective or not? or is there any better way of using LZ4 package better than this?
Another problem that I came across was the end result. After compression of the tared files I would get an end result like MyResult.lz4 file as output but I was not able to decompress it using the archive manager ( I'm using ubuntu ) as it doesn't support this format. I'm also not clear about the archive and compression format that I have to use here. I also want to know what format should the end result be in. So now I'm speaking from an user point of view, consider a case where I'm generating a backup for the user if I provide him/her with traditional .zip, .gz or any known formats, the user would be in a position to decompress it by himself. As I know LZ4 doesn't mean I've to expect the user also should know such format right? He may even get baffled on seeing such a format. So this means a conversion from .lz4 to .zip format also seems meaningless. I already see the taring process of all my input files as a time consuming process, so I wanted to know how much it affects the performance. As I've seen in java zip package compressing multiple input files didn't seem to be a problem at all. So next to lz4 I came across Apache common compress and TrueZIP. I also came across several stack overflow links about them which helped me learn a lot. As of now I really wanted to use LZ4 for compression especially due it's performance but I came across these hurdles. Can anyone who has a good knowledge about LZ4 package provide solutions to all my queries and problems along with a simple implementation. Thanks.
Time I calculated for an input consisting of many files,
Time taken for taring : 4704 ms
Time taken for converting file to byte array : 7 ms
Time Taken for compression : 33 ms
Some facts:
LZ4 is no different here than GZIP: it is a single-concern project, dealing with compression. It does not deal with archive structure. This is intentional.
Adrien Grand's LZ4 lib produces output incompatible with the command-line LZ4 utility. This is also intentional.
Your approach with tar seems OK becuase that's how it's done with GZIP.
Ideally you should make the tar code produce a stream which is immediately compressed instead of first being entirely stored in RAM. This is what is achieved at the command line using Unix pipes.
I had the same problem. The current release of LZ4 for Java is incompatible with the later developed LZ4 standard to handle streams, however, in the projects repo there is a patch that supports the standard to compress/decompress streams, and I can confirm it is compatible with the command line tool. You can find it here https://github.com/jpountz/lz4-java/pull/61 .
In Java you can use that together with TarArchiveInputStream from the Apache Commons compress.
If you want an example, the code I use is in the Maven artifact io.github.htools 0.27-SNAPSHOT (or at github) the classes io.github.htools.io.compressed.TarLz4FileWriter and (the obsolete
class) io.github.htools.io.compressed.TarLz4File show how it works. In HTools, tar and lz4 are automatically used through ArchiveFile.getReader(String filename) and ArchiveFileWriter(String filename, int compressionlevel) provided your filename ends with .tar.lz4
You can chain IOStreams together, so using something like Tar Archive from Apache Commons and LZ4 from lz4-java,
try (LZ4FrameOutputStream outputStream = new LZ4FrameOutputStream(new FileOutputStream("path/to/myfile.tar.lz4"));
TarArchiveOutputStream taos = new TarArchiveOutputStream (outputStream)) {
...
}
Consolidating the bytes into a byte array will cause a bottleneck as you are not trying to hold the entire stream in-memory which can easily run into OutOfMemory problems with large streams. Instead, you'll want to pipeline the bytes through all the IOStreams like above.
I created a Java library that does this for you https://github.com/spoorn/tar-lz4-java.
If you want to implement it yourself, here's a technical doc that includes details on how to LZ4 compress a directory using TarArchive from Apache Commons and lz4-java: https://github.com/spoorn/tar-lz4-java/blob/main/SUMMARY.md#lz4
Hi guys : I have a file system with lots of "parallel" data in it (details : its a local hadoop development environment). In any case,
I want a file browser tool that is pluggable, so that when I click on certain files, certain readers are invoked.
I also want to compare parallel directories . For example if I have a/ b/ and c/, each of which has output.txt, I want to compare the size/contents of output.txt across those directories.
Although I realize these are somewhat strange comparisons to do - I believe programmers probably do such comparisons quite often. Does any generic tool exist for browsing large file/directory repos on disk ?
Hopefully, it would be java, and be java pluggable , but even a simple Mac OS X application might be useful to some extent.
Kdiff3 has a great directory compare tool.
For comparing files, the search term you are looking for is 'diff viewer'. Google immediately returns a link to this SO question. There the recommended viewer is FileMerge, although personnaly I prefer Kdiff3, which seems to support OS X as well.
Why does Java not have a file copy method? This seems like such an obvious thing to have, and it saves people from writing things like this example.
The Java API is missing more than just file copying. You might be interested in checking out the Apache Commons libraries. For example, the IO library's FileUtils provides file copying methods.
My guess is because when the File io system was written, they decided they did not want to deal with the cross-platform problems of copying files, and punted - i.e. they said "this is doable by others, and is not that common".
One thing to keep in mind about Java is that it is cross-platform, so some things are more difficult because of that reality.
java.io.File is a relatively simple class introduced in 1.0. JDK 1.0 didn't have much in it - mostly related to support for applets and the javac compiler. I guess there hasn't been a great deal of pressure to expand it - applets and enterprise software are not oriented in that direction.
However, lots has been added to I/O for JDK7. Including [java.nio.file.Path.copyTo][1].
[1]: http://download.java.net/jdk7/docs/api/java/nio/file/Path.html#copyTo(java.nio.file.Path, java.nio.file.CopyOption...)
For the same reason Java does not have many other things. which end up being implemented by external libraries.
I am sure you can easily find such a library, or you can write a function.