High performance string hashing function in Java/Scala [closed] - java

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
Looking for a high-performance String hashing functions in Java/Scala - something faster than functions from MurmurHash family, doesn't need to be cryptographically strong, only distribute well.
Any suggestions?

You can find very fast hash function implementations for Java, which BTW account internal String implementation (char[] array) to maximize speed, here: https://github.com/OpenHFT/Zero-Allocation-Hashing

The fastest hashing algorithm that fits the bill presently seems to be xxHash. The lz4-java project contains an implementation ported to Java. I don't know whether the Java implementation has been benchmarked against MurmurHash, though; performance optimizations in C++ don't always port to/from Java. (In particular, xxHash contains more array access, so there could be non-negligible bounds-checking overhead.)
Edit: it looks to me like it uses JNI to call the C++ implementation of xxHash, but JNI overhead is non-negligible, so the performance concerns remain.
However, given that Scala includes a MurmurHash function, and that Java contains a faster default hash (about 2x) that is sorta-reasonably distributed sometimes, one does wonder whether it's really necessary. For instance, scala.util.hashing.MurmurHash3 is about as fast as string creation from an array of bytes, and is twice as fast as that if you give it an array of bytes.

Related

Garbage-collected languages with efficient numeric data types [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
I am searching for a language/library (preferably JVM-based) that handles numeric values (integer and floating point numbers) in both convenient and efficient manner.
Convenient: supported by the collection framework and generics.
Efficient: incurs no noticeable overhead when the primitives are the
building block in a data-heavy data-processing software
(specifically, processing multiple GB of texts with >100,000,000
items).
Deficiencies of the current languages:
Plain Java: auto-boxing is quite convenient, but it has substantial
overhead.
Scala and Kotlin: seem to rely also on Java's boxed
primitives, so no efficiency advantage here.
Python: again, seems to box all numeric values, and we ran into prohibitive performance problems with vanilla Python. Numpy, which provides a different implementation, does not support the needed features.
Is there a language that handles primitives with the same convenience but efficiently (compared to that language general performance)?
C# fits the criteria, depending on what you mean by the efficiency requirement. It doesn't run on the JVM, of course.
Unlike Java, which implements generics with type erasure, C# implements generics via reification like C++ does. That means that when you make a List<int>, the underlying array will be an array of int, not an array of objects. Also the code that implements all the List methods will be compiled specifically for List<int>, and can take advantage of int-specific optimizations.
For this reason, data processing with primitive types is generally faster in C# than it is in Java when you're using all the convenient language features. It can still be far from what you can get with C++, however, because the runtime checks that prevent buffer overrun, etc., are not free.

scipy linkage (cluster) function equivalient in java/scala [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this question
I am looking for a library in java or scala which can do the same clustering like scipy's linkage does.
Performs hierarchical/agglomerative clustering.
The input y may be either a 1d compressed distance matrix or a 2d
array of observation vectors.
If y is a 1d compressed distance matrix, then y must be a (n2)(n2)
sized vector where n is the number of original observations paired in
the distance matrix. The behavior of this function is very similar to
the MATLAB linkage function.
The java libraries I have found (like jblas) are pretty low level lacking of higher order algoritms like linkage. On the other hand I am pretty sure there are some libraries doing that. Would be nice if you could pin point me to one or two.
PS One can find a lot of indviduals implementing some hierarchical clustering, I prefer something more trustable library like commons math if possible. But there I could only find k means clustering.
In the end I am using this library https://github.com/lbehnke/hierarchical-clustering-java
Its not heavyly maintained but passes the comparisment to python and matlab implementations.

Fully persistent linked list [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
Why isn't there any implementation (in C, C++, Java or even Python...) of a fully persistent (not necessarily functional) linked list that has a constant time/space overhead in the number of modifications?
The data structure I have in mind is the one described in this paper:
http://www.cs.cmu.edu/~sleator/papers/Persistence.htm
After a long search on google I was unable to find even a partially persistent linked list implementation with the overhead sited above.
PS: The definitions of persistence I am speaking about are those described in the following Wikipedia page:
http://en.wikipedia.org/wiki/Persistent_data_structure
EDIT(after the question was put on hold):
I don't think the reason mentioned applies to my question. I am not exactly asking for recommendation among different available libraries, so there can t be "opinionated answers and spam". My question is kind of astonishment that a data structure, that is supposed to be great in theory, was not implemented by any of the known languages. So before I implement it myself I asked this question to see if there is an answer like: "It is normal, the data structure X dominates the one you re looking for and that's why it has not been implemented despite its simplicity". Another answer could be "It is not as good as you think since there is a big hidden constant" or "it doesn t do well with the way caches are built nowadays"... I am sorry if my question was not clear enough. I transformed my question making my request more explicit now.
Have you tried Functional Java library? It got some persistent data structures:
http://www.functionaljava.org/features.html

Fast accurate sparse svd library? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
I'm looking for a fast svd library, in either c, c++ or java. Ultimately I'm using Java, but I'm very comfortable using jna to wrap c++, eg http://github.com/hughperkins/jeigen
I'm looking for a fast svd library that will handle sparse matrices. To keep this objective, so that the question doesn't get marked as too subjective, let's say:
targeting use with news20.binary , eg from http://mldata.org/repository/data/viewslug/news20binary/
how fast does it take to run?
how much variance is conserved, eg for an S matrix of size 6 or 20?
I looked around at a few libraries and found:
matlab: super fast, about 10 seconds, but it's not really a 'library' as such. average squared projection error: 0.93
redsvd: super fast, about 1 second to run, for 6 features, but the average squared projection error is 0.97, which is very high
Eigen's svd is both very slow, and only for dense matrices
svdlibc: ran for 28 minutes before I stopped it; I guess it's calculating the full S, rather than just the first 6 features or so
Basically, I'm looking for a library that gives about the same speed and average squared projection error as matlab, or at least, somewhat comparable.
From my experience, svdlibc is the best library of those options. I've dug a bit through its code before and I don't believe it's calculating the full S matrix (i.e., it is a true "thin svd"). If you can control the matrix representation on disk, svdlibc performs much faster when using the sparse binary input format due to the significantly lower I/O overhead.
The S-Space Package provided an executable jar around the SVDLIBJ java port of SVDLIBC. However, they found it had different results than SVDLIBC for certain input solutions.

binary decision diagram [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
In Java, I have set of expressions like cond1 AND (cond2 OR cond3) AND ( cond 4 OR cond5). I would like to convert it into tree and then evaluate the final boolean answer. I tried searching a lot around java BDD but not able to get any. Any suggestion with sample code ?
A 5-second Google search returned some reasonable-looking results:
JavaBDD
Java Decision Diagram Libraries
What is the best Binary Decision Diagram library for Java?
Is this not what you're looking for?
He means Binary Decision Diagrams.
I've been tinkering with JavaBDD and JBDD/JDD. Both are based on BuDDY (a C library) -- JBDD actually uses the C DLLs for a marginal performance boost.
It looks to me like JavaBDD is more fully-featured (ex. it supports composing BDDs, which is what I need). But there is also no tutorial for it, and while the class docs aren't terrible, frankly I can't figure out how to use it for the most basic of boolean operations (like the problem you pose).
JBDD/JDD requires you to use manual garbage collection, and does weird things like store BDD objects in Java integers -- clearly carry-overs from C. But it has a set of tutorials.
If you want to run your own parser, check out JavaCC.
Here is a nice tutorial to get you started. A bit older, but still valid:
http://www.javaworld.com/jw-12-2000/jw-1229-cooltools.html

Categories

Resources