Do you think it would be possible to implement sparse matrix operations using the new Stream interface in Java 1.8 ? If yes, how do we need to implement the matrixes and the operations. Clearly, I am looking for it for being able eventually to use the "automatic" parallelization.
It can clearly be done. How about something like below for a simple SPMV (Sparse matrix vector multiplication), with the sparse matrix represented in the coordinate COO format (the simplest sparse format out there):
class COO {
int x, y, value;
}
public static ArrayList<Integer> spmv(List<COO> values, ArrayList<Integer> v) {
final ArrayList<Integer> result = new ArrayList<>(Collections.nCopies(v.size(), 0));
values.stream().forEach(
coo -> result.set(coo.x, result.get(coo.x) + coo.value * v.get(coo.y))
);
return result;
}
But I sincerely suggest you use something pre-coded, if you don't want to spend the next 3 years of your life understanding the performance implications of sparse matrix operations.
This is quite a large research/optimisation topic and there are many factors to consider like (just off the top of my head):
scheduling / reordering of matrix values to improve cache performance
using an optimal storage format for specific problems (e.g. see this survey on netlib)
There are many implementations out there that can achieve orders of magnitude improvements in performance versus hand crafted implementation. To name a few, check out:
Intel MKL Sparse BLAS
Nvidia's cuBLAS
I would just write bindings to those if they don't exist already, although something like la4j looks quite promising.
Related
I am trying to convert a very long int[] with length of 1,000,000 to Integer[] so that I can sort it with a custom comparator (which sorts elements based on the length of their corresponding lists in a defined Map<Integer, List<Integer>>).
I have done the following:
private static Integer[] convert(int[] arr) {
Integer[] ans = new Integer[arr.length];
for (int i = 0; i < arr.length; i++) {
ans[i] = arr[i];
}
return ans;
}
It works well for me but I have also come across
Integer[] ans = Arrays.stream(intArray).boxed().toArray( Integer[]::new );
and
Integer[] ans = IntStream.of(intArray).boxed().toArray( Integer[]::new );
Is there any of them that is significantly faster than the rest? Or is there any other approach that is fast enough to shorten the run-time?
Is there any of them that is significantly faster than the rest?
You realize the question you're asking is akin to:
"I have 500,000 screws to screw into place. Unfortunately, I can't be bothered to go out and buy a screwdriver. I do have a clawhammer and an old shoe. Should I use the clawhammer to bash these things into place, or is the shoe a better option?"
The answer is clearly: Uh, neither. Go get a screwdriver, please.
To put it differently, if the 'cost' of converting to Integer[] first is 1000 points of cost in some arbitrary unit, then the difference in the options you listed is probably between 0.01 and 0.05 points - i.e. dwarfed so much, it's irrelevant. Thus, the direct answer to your question? It just does not matter.
You have 2 options:
Performance is completely irrelevant. In which case this is fine, and there's absolutely no point to actually answering this question.
You care about performance quite a bit. In which case, this Integer[] plan needs to be off the table.
Assuming you might be intrigued by option 2, you have various options.
The easiest one is to enjoy the extensive java ecosystem. Someone's been here before and made an excellent class just for this purpose. It abstracts the concept of an int array and gives you all sorts of useful methods, including sorting, and the team that made it is extremely concerned about performance, so they put in the many, many, many personweeks it takes to do proper performance analysis (between hotspot, pipelining CPUs, and today's complex OSes, it's much harder than you might think!).
Thus, I present you: IntArrayList. It has a .sortThis() method, as well as a .sortThis(IntComparator c) method, which you can use for sorting purposes.
There are a few others out there, searching the web for 'java primitive collections' should find them all, if for some reason the excellent eclipse collections project isn't to your liking (NB: You don't need eclipse-the-IDE to use it. It's a general purpose library that so happens to be maintained by the eclipse team).
If you must handroll it, searching the web for how to implement quicksort in java is not hard, thus, you can easily write your own 'sort this int array for me' code. Not that I would reinvent that particular wheel. Just pointing out that it's not too difficult if you must.
I have a large dataset (>500.000 elements) that contains the stress values (σ_xx, σ_yy, σ_zz, τ_xy, τ_yz, τ_xz) of FEM-Elements. These stress values are given in the global xyz-coordinate space of the model. I want to calculate the main axis stress values and directions from those. If you're not that familiar with the physics behind it, this means taking the symmetric matrix
| σ_xx τ_xy τ_xz |
| τ_xy σ_yy τ_yz |
| τ_xz τ_yz σ_zz |
and calculating its eigenvalues and eigenvectors. Calculating each set of eigenvalues and -vectors on its own is too slow. I'm looking for a library, an algorithm or something in Java that would allow me to do this as array calculations. As an example, in python/numpy I could just take all my 3x3-matrices, stack them along a third dimension to get a nx3x3-array, and pass that to np.linalg.eig(arr), and it automatically gives me an nx3-array for the three eigenvalues and an nx3x3-array for the three eigenvectors.
Things I tried:
nd4j has an Eigen-module for calculating eigenvalues and -vectors, but only supports a single square array at a time.
Calculate the characteristic polynomial and use cardanos formula to get the roots/eigenvalues - possible to do for the whole array at once, but I'm stuck now on how to get the corresponding eigenvectors. Is there maybe a general simple algorithm to get from those to the eigenvectors?
Looking for an analytical form of the eigenvalues and -vectors that can be calculated directly: It does exist, but just no.
You'll need to write a little code.
I'd create or use a Matrix class as a dependency and find methods to give you eigenvalues and eigenvectors. The ones you found in nd4j sound like great candidates. You might also consider the Linear Algebra For Java (LA4J) dependency.
Load the dataset into a List<Matrix>.
Use functional Java methods to apply a map to give you a List of eigenvalues as a vector per stress matrix and a List of eigenvectors as a matrix per stress matrix.
You can optimize this calculation to the greatest extent possible by applying the map function to a stream. Java will parallelize the calculation under the covers to leverage available cores to the greatest extent possible.
Follow-up: This is the way that worked best for me, as I can do all operations without iterating over every element. As stated above, I'm using Nd4j, which seems to be limited in its possibilities compared to numpy (or maybe I just didn't read the documentation thoroughly enough). The following method uses only basic array operations:
From the given stress values, calculate the eigenvalues using Cardano's formula. Only element wise instructions are needed to do that (add, sub, mul, div, pow). The result should be three vectors of size n, each containing one eigenvalue for all elements.
Use the formula given here to calculate the matrix S for each eigenvalue. Like step 1, this can obviously also be done using only element-wise operations with the stress value- and eigenvalue-vectors, in order to avoid specifiying some complicated instructions on which array to multiply according to which axis while keeping whatever other axis.
Take one column from S and normalize it to get a normalized eigenvector for the given eigenvalue.
Note that this method only works if you have a real symmetric matrix. You also should make sure to properly deal with cases where the same eigenvalue appears multiple times.
I need to efficiently find the ratio of (intersection size / union size) for pairs of Lists of strings. The lists are small (mostly about 3 to 10 items), but I have a huge number of them (~300K) and have to do this on every pair, so I need this actual computation to be as efficient as possible. The strings themselves are short unicode strings -- averaging around 5-10 unicode characters.
The accepted answer here Efficiently compute Intersection of two Sets in Java? looked extremely helpful but (likely because my sets are small (?)) I haven't gotten much improvement by using the approach suggested in the accepted answer.
Here's what I have so far:
protected double uuEdgeWeight(UVertex u1, UVertex u2) {
Set<String> u1Tokens = new HashSet<String>(u1.getTokenlist());
List<String> u2Tokens = u2.getTokenlist();
int intersection = 0;
int union = u1Tokens.size();
for (String s:u2Tokens) {
if (u1Tokens.contains(s)) {
intersection++;
} else {
union++;
}
}
return ((double) intersection / union);
My question is, is there anything I can do to improve this, given that I'm working with Strings which may be more time consuming to check equality than other data types.
I think because I'm comparing multiple u2's against the same u1, I could get some improvement by doing the cloning of u2 into a HashSet outside of the loop (which isn't shown -- meaning I'd pass in the HashSet instead of the object from which I could pull the list and then clone into a set)
Anything else I can do to squeak out even a small improvement here?
Thanks in advance!
Update
I've updated the numeric specifics of my problem above. Also, due to the nature of the data, most (90%?) of the intersections are going to be empty. My initial attempt at this used the clone the set and then retainAll the items in the other set approach to find the intersection, and then shortcuts out before doing the clone and addAll to find the union. That was about as efficient as the code posted above, presumably because of the trade of between it being a slower algorithm overall versus being able to shortcut out a lot of the time. So, I'm thinking about ways to take advantage of the infrequency of overlapping sets, and would appreciate any suggestions in that regard.
Thanks in advance!
You would get a large improvement by moving the HashSet outside of the loop.
If the HashSet really has only got a few entries in it then you are probably actually just as fast to use an Array - since traversing an array is much simpler/faster. I'm not sure where the threshold would lie but I'd measure both - and be sure that you do the measurements correctly. (i.e. warm up loops before timed loops, etc).
One thing to try might be using a sorted array for the things to compare against. Scan until you go past current and you can immediately abort the search. That will improve processor branch prediction and reduce the number of comparisons a bit.
If you want to optimize for this function (not sure if it actually works in your context) you could assign each unique String an Int value, when the String is added to the UVertex set that Int as a bit in a BitSet.
This function should then become a set.or(otherset) and a set.and(otherset). Depending on the number of unique Strings that could be efficient.
I am looking for a Java library that closely mirrors matlab's Matrix functions and possibly other functions in the areas of polynomial interpolation, etc.
If such a library does not exist I was toying with the idea of building my own but using an existing Matrix or scientific computing library to do the heavy lifting - if I were to do that which libraries would be candidates to serve as backends for such an effort
Eigen, one of the most used (and fastest) library for matrix computation in C++, has a java wrapper: jeigen.
It allows one to manipulate full and sparse matrices and make operations one them. It can be also worth trying.
Check out the following resources/packages
http://math.nist.gov/javanumerics/jama/
http://www.jscience.org/
Try to look at la4j (Linear Algebra for Java). It supports dense matrices as well as sparse ones. Here is just a brief example of using functional features of la4j:
// reads the dense matrix from the CSV file
Matrix a = new Basic2DMatrix(Mattrices.asSymbolSeparatedSource("matrix.csv", ","));
// calculates the sum of all elements of the matrix 'a'
double sum = a.fold(Matrices.asSumAccumulator(0));
// creates a new matrix 'b', that contains elements of matrix 'a' multiplied by '2'.
Matrix b = a.transform(Matrices.asMulFunction(2));
The best way to get the last version of la4j - visit it's GitHub page.
I use the Colt library for matrix operations.
See in: http://acs.lbl.gov/software/colt/api/index.html
I think it's really good and easy to use and is better than Apache Commons-Math and EJML that I have already tried.
I suggest you try all of the libraries mentioned and choose the one that is closer to your needs.
Which is the best way to implement a sparse vector in Java?
Of course the good thing would be to have something that can be manipulated quite easily (normalization, scalar product and so on)
Thanks in advance
MTJ has a Sparse Vector class. It has norm functions (1-norm 2-norm and ∞-norm) and dot product functions.
JScience has a SparseVector implementation that is part of its linear algebra package.
You can also try to look at la4j's CompressedVector implementation. It uses pair of arrays: array of values and array of their indicies. And with binary search on top of that it just flies. So, this implementation guarantees O(log n) running time for get/set operations.
Just a brief example
Vector a = new CompressedVector(new double[]{ 1.0, 2.0, 3.0 }).
// calculates L_1 norm of the vector
double n = a.norm();
// calculates the sum of vectors elements
double s = a.fold(Vectors.asSumAccumulator(0.0));