Delta encoders: Using Java library in Scala

Delta encoders: Using Java library in Scala - java

I have to compare using Spark-based big data analysis data sets (text files) that are very similar (>98%) but with very large sizes. After doing some research, I found that most efficient way could be to use delta encoders. With this I can have a reference text and store others as delta increments. However, I use Scala that does not have support for delta encoders, and I am not at all conversant with Java. But as Scala is interoperable with Java, I know that it is possible to get Java lib work in Scala.
I found the promising implementations to be xdelta, vcdiff-java and bsdiff. With a bit more searching, I found the most interesting library, dez. The link also gives benchmarks in which it seems to perform very well, and code is free to use and looks lightweight.
At this point, I am stuck with using this library in Scala (via sbt). I would appreciate any suggestions or references to navigate this barrier, either specific to this issue (delata encoders), library or in working with Java API in general within Scala. Specifically, my questions are:
Is there a Scala library for delta encoders that I can directly use? (If not)
Is it possible that I place the class files/notzed.dez.jar in the project and let sbt provide the APIs in the Scala code?
I am kind of stuck in this quagmire and any way out would be greatly appreciated.

There are several details to take into account. There is no problem in using directly the Java libraries in Scala, either using as dependencies in sbt or using as unmanaged dependencies https://www.scala-sbt.org/1.x/docs/Library-Dependencies.html: "Dependencies in lib go on all the classpaths (for compile, test, run, and console)". You can create a fat jar with your code and dependencies with https://github.com/sbt/sbt-native-packager and distributed it with Spark Submit.
The point here is to use these frameworks in Spark. To take advantage of Spark you would need split your files in blocks to distribute the algorithm across the cluster for one file. Or if your files are compressed and you have each of them in one hdfs partition you would need to adjust the size of the hdfs blocks, etc ...
You can use the C modules and include them in your project and call them via JNI as frameworks like deep learning frameworks use the native linear algebra functions, etc. So, in essence, there are a lot to discuss about how to implement these delta algorithms in Spark.

Related

Importing Java class into a python project

I've been trying to find a method to importing Java-ml into my python project. I have the jar file in the same path as my project.
I want to use it for kmeans clustering, since it allows me to change the distance metric. I am wondering though whether with the implementation that one of you suggest, whether I'll be able to pass a different java class as a parameter for the function?
I tried using:
import sys
sys.path.append(r"C:\Users\X\Desktop\X\javaml-0.1.7\javaml-0.1.7.jar")
import net.sf.javaml as jml
test = jml.clustering.Kmeans()
I considered using jython, however I am unsure of how it works, and it is unclear whether I could continue using idle and whether I would have to reprogram my project.
Lastly I considered using PyJNIus, however it is simply not working.

In short, you can't run Java code natively in a CPython interpreter.
Firstly, Python is just the name of the specification for the language. If you are using the Python supplied by your operating system (or downloaded from the official Python website), then you are using CPython. CPython does not have the ability to interpret Java code.
However, as you mentioned, there is an implementation of Python for the JVM called Jython. Jython is an implementation of Python that operates on the JVM and therefore can interact with Java modules. However, very few people work with Jython and therefore you will be a bit on your own about making everything work properly. You would not need to re-write your vanilla Python code (since Jython can interpret Python 2.x) but not all libraries (such as numpy) will be supported.
Finally, I think you need to better understand the K-Means algorithm, as the algorithm is implicitly defined in terms of the Euclidean distance. Using any other distance metric would no longer be considered K-Means and may affect the convergence of the algorithm. See here for more information.
Again, you can't run Java code natively in a CPython interpreter. Of course there are various third party libraries that will handle marshalling of data between Java and Python. However, I stand by my statement that for this particular use case you are likely better to use a native Python library (something like K-Medoid in Scikit-Learn). Attempting to call through to Java, with all the associated overhead, is overkill for this problem, in my opinion.

To "answer" your question directly, Jython will be your best bet if you simply want to import Java classes. Jython strives very hard to be as compatible with Python 2.x as possible and does a good job. So you won't have to spend too much time rewriting code. Just simply run it with Jython and see what happens, then modify what breaks.
Now for the Python answer :D. You may want to use scikit for a native implementation. It will certainly be faster than running anything in Jython.
Update
I think the Py4J module is what you're looking. It works by running a server in your Java code and the Python code will communicate with the Java server. The only good thing about "Py4J" is that it provides the boiler plate code for you. You can very easily setup your own client/server with no extra modules. However I still don't think it's a superior option compared to Pythons native modules.
References
How to import Java class w/ Jython
Scikit - K-Means

Flexible jni project

It seems like that working with jni will become my everyday routine for a few months. Is there any some tools which simplify dealing with mixed Java + C++ projects?
Is it possible to re-generate glue *.h files and rebuild native libraries automatically? Or I should write some scripts for maven, ant, gradle, anything_else?
Is there any experience?

Check out JavaCPP! I also list other solutions on that page... There's also Jace that is useful when trying to use Java from C++.

Some months ago I faced the same questions. It seems that Java/C++ interop is reviving just now, and that you are one of the pioneers.
If you're merely using C++ objects from Java, JNA may be a better solution.
If you're using Java from C++, I didn't yet encounter a mature library. Although functionally quite complete, JNI is is a C api (intentionally, if you read the design rationale). If you are about to write lots of code for it, I think it'll pay to write a C++ framework around it that wraps the bare jobject ,jnienv, jclass... handles into explicit resources.
The real issues arise when the C++ and Java have to co-operate using callbacks etc... Buckle up if that's your intent...

You are asking about an experience. So my experience is, that you should start with very well designed requirements, behavior and objects lifecycle. That should result in a mature interface which will change very little in the future. The effect is that you will need to change the glue header files rarely and simple one shot javah is good enough. It all doesn't sound very agile i know, but then JNI is everything but a rapid development environment.
Changing the interface twice a day, adding and removing methods and changing signatures "just to see if it helps" is a sure road to hell. You are connecting two very different worlds in terms of memory management and JVM can get nervous very easily. Thread safety is yet another level up. The mentioned helper solutions, while they are undoubtely a clever piece of software, might give you a false perception that JNI is easy. Then JVM starts giving you exceptions out of nowhere, your objects will start geting uninitalized randomly, etc...

You can use SWIG to automatically generate glue code and have an make target to rebuild the native libraries. You can also use ANT's c++ task for the same purpose.

Existing solution for file deltas/versioning in Java

When versioning or optimizing file backups one idea is to use only the delta or data that has been modified.
This sounds like a simple idea at first but actually determining where unmodified data ends and new data starts comes accross as a difficult task.
Is there an existing framework that already does something like this or an efficient file comparison algorithm?

XDelta is not Java but is worth looking at anyway. There is Java version of it but I don't know how stable is it.

Instead of rolling your own, you might consider leveraging an open source version control system (eg, Subversion). You get a lot more than just a delta versioning algorithm that way.

It sounds like you are describing a difference based storage scheme. Most source code control systems use such systems to minimize their storage requirements. The *nix "diff" command is capable of generating the data you would need to implement it on your own.

Here's a Java library that can compute diffs between two plain text files:
http://code.google.com/p/google-diff-match-patch/
I don't know any library for binary diffs though. Try googling for 'java binary diff' ;-)

As for my opinion, Bsdiff tool is the best choice for binary files. It uses suffix sorting (Larsson and Sadakane's qsufsort) and takes advantage of how executable files change. Bsdiff was written in C++ by Colin Percival. Diff files created by Bsdiff are generally smaller than the files created by Xdelta.
It is also worth noting that Bsdiff uses bzip2 compression algorithm. Binary patches created by Bsdiff sometimes can be further compressed using other compression algorithms (like the WinRAR archiver's one).
Here is the site where you can find Bsdiff documentation and download Bsdiff for free: http://www.daemonology.net/bsdiff/

Using Java from C++

As a C++ developer, I occasionally come across Java libraries like iText, Batik, JasperReports, and JFreeChart. In each case, equivalent cross-platform C++ libraries seem to be much less mature, much more expensive, or unavailable.
Is it practical to use these Java libraries from my C++ app for reporting, charting, and similar? If so, what's the best approach to doing so?
Use JNI to embed a JVM within my application?
Use GCJ to compile the Java libraries to native code?
Some other integration method that I'm not aware of?
Give up, since calling a Java library from C++ would be too hard to be practical, and instead invest my efforts in finding C++ libraries?

The least complicated method of integration is the old-school UNIX approach: launch a small Java program that does the task you need and communicate with it on STDIN/STDOUT.
This may not be possible in all cases, but it definitely is for use cases like PDF, SVG, reporting and charting which largely involve generating single documents for saving or display.
Watch out for log4j, slf4j, JUL, etc. logging if you take this approach! Anything that the Java program writes to standard out could corrupt the document you receive in the C++ program. Disabling logging or using sockets may be better in that case.

Embedding dendrogram in Java

I'm looking for a library capable of drawing dendrograms of data in Java (not calculating them, I can do it by myself).. do you have any clues? Already tried to search it over Google but haven't found anything that is not stand-alone (while I need to embed the generation inside my program).
Thanks!

Check out the JUNG graph library. It won't perform the actual clustering for you but is a really good library for visualising your results.

Take a look at Archaeopteryx. It has fairly many features; it's open source, and it is available in a pre-packaged jar file.
BTW, I use JUNG and really like it. It can perform various clusterings, but AFAIK, it has no inherent dendrogram capabilities. Because it has graphing capabilities, you could roll your own dendrogram, but it would take some work.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.