Difference between RowMatrix and Matrix in Apache Spark? - java

I want to know the basic difference between RowMatrix and Matrix class available in Apache Spark.

A little bit more precise question here would be what is a difference between mllib.linalg.Matrix and mllib.linalg.distributed.DistributedMatrix.
Matrix is a trait which represents local matrices which reside in a memory of a single machine. For now there are two basic implementations: DenseMatrix and SparseMatrix.
DistributedMatrix is a trait which represents distributed matrices build on top of RDD. RowMatrix is a subclass of a DistributedMatrix which stores data in a row-wise manner without meaningful row ordering. There are other implementations of DistributedMatrix (like IndexedRowMatrix, CoordinateMatrix and BlockMatrix) each with its own storage strategy and specific set of methods. See for example Matrix Multiplication in Apache Spark

This is going to come down a little to the idioms of the language / framework / discipline you're using, but in computer science, an array is a one dimensional "list" of "things" that can be referenced by their position in the list. One of the things that can be in the list is another array which let you make arrays of arrays (of arrays of arrays ...) giving you a data set arbitrarily large dimension.
A matrix comes from linear algebra and is a two dimensional representation of data (which can be represented by an array of arrays) that comes with a powerful set of mathematical operations that allows you to manipulate the data in interesting ways. While arrays can vary in size, the width and height of a matrix is generally know based on the specific type of operations you're going to perform.
Matrixes are used extensively in 3d graphics and physics engines because they are a fast, convenient way of representing transformation and acceleration data for objects in three dimensions.
Array : Collection of homogeneous elements.
Matrix : A simple row and column thing.
Both are different things in different spaces.
But in computer programming, a collection of single dimensions array can be termed as matrix.
You can represent an 2d Array(i.e, collection of single dimension arrays) in matrix form.
Example
A[2][3] : This means A is a collection of 2 single dimension arrays
each of size 3.
A[1,1] A[1,2] A[1,3] //This is a single dimensional
array
A[2,1] A[2,2] A[2,3] //This is another single dimensional array
//The collection is a multi-dimensional or 2d Array.

Related

Calculating the principal axis/eigenvalues and -vectors of large dataset in Java

I have a large dataset (>500.000 elements) that contains the stress values (σ_xx, σ_yy, σ_zz, τ_xy, τ_yz, τ_xz) of FEM-Elements. These stress values are given in the global xyz-coordinate space of the model. I want to calculate the main axis stress values and directions from those. If you're not that familiar with the physics behind it, this means taking the symmetric matrix
| σ_xx τ_xy τ_xz |
| τ_xy σ_yy τ_yz |
| τ_xz τ_yz σ_zz |
and calculating its eigenvalues and eigenvectors. Calculating each set of eigenvalues and -vectors on its own is too slow. I'm looking for a library, an algorithm or something in Java that would allow me to do this as array calculations. As an example, in python/numpy I could just take all my 3x3-matrices, stack them along a third dimension to get a nx3x3-array, and pass that to np.linalg.eig(arr), and it automatically gives me an nx3-array for the three eigenvalues and an nx3x3-array for the three eigenvectors.
Things I tried:
nd4j has an Eigen-module for calculating eigenvalues and -vectors, but only supports a single square array at a time.
Calculate the characteristic polynomial and use cardanos formula to get the roots/eigenvalues - possible to do for the whole array at once, but I'm stuck now on how to get the corresponding eigenvectors. Is there maybe a general simple algorithm to get from those to the eigenvectors?
Looking for an analytical form of the eigenvalues and -vectors that can be calculated directly: It does exist, but just no.
You'll need to write a little code.
I'd create or use a Matrix class as a dependency and find methods to give you eigenvalues and eigenvectors. The ones you found in nd4j sound like great candidates. You might also consider the Linear Algebra For Java (LA4J) dependency.
Load the dataset into a List<Matrix>.
Use functional Java methods to apply a map to give you a List of eigenvalues as a vector per stress matrix and a List of eigenvectors as a matrix per stress matrix.
You can optimize this calculation to the greatest extent possible by applying the map function to a stream. Java will parallelize the calculation under the covers to leverage available cores to the greatest extent possible.
Follow-up: This is the way that worked best for me, as I can do all operations without iterating over every element. As stated above, I'm using Nd4j, which seems to be limited in its possibilities compared to numpy (or maybe I just didn't read the documentation thoroughly enough). The following method uses only basic array operations:
From the given stress values, calculate the eigenvalues using Cardano's formula. Only element wise instructions are needed to do that (add, sub, mul, div, pow). The result should be three vectors of size n, each containing one eigenvalue for all elements.
Use the formula given here to calculate the matrix S for each eigenvalue. Like step 1, this can obviously also be done using only element-wise operations with the stress value- and eigenvalue-vectors, in order to avoid specifiying some complicated instructions on which array to multiply according to which axis while keeping whatever other axis.
Take one column from S and normalize it to get a normalized eigenvector for the given eigenvalue.
Note that this method only works if you have a real symmetric matrix. You also should make sure to properly deal with cases where the same eigenvalue appears multiple times.

How do you declare and create a matrix of 3 rows x 5 columns that contains values of Double types in Java?

Some people tell me its the first option, but other people tell me its the second one.
Where do the rows and where to de columns actually go?
I'd appreciate the help, thanks.
1) Double array[][] = new Double[5][3];
2) Double array[][] = new Double[3][5];
This question usually comes down to which convention do you prefer or which one is predominantly used in your field or, more narrowly, programming language.
I found this answer stating that Java is "row major". This is the convention I always follow while working with 2D arrays in Java. Although, Wikipedia article on row- and column-major order refers to the part of Java Language Specification, stating that this langauge is neither row-major nor column-major. Instead, it uses Iliffe vectors to store multi-dimensional arrays, meaning that data in the same row is stored continuously in the memory, but the rows themselves are not. Address of the first element of each row is stored in an array of pointers.
Despite it's impossible to clasify Java memory model as a strictly row- or column-major respective, the usage of Iliffe vectors prompts to perceive it as row-major. Therefore, in order to create a matrix of 3 rows and 5 columns, you should use:
Double array[][] = new Double[3][5];
There is no real concept of matrices in Java. What you're referring to is a two-dimensional array, or in other words an array of arrays.
Writing Double[5][3] will create an array of length 5, containing arrays of length 3 and your other example will do the opposite. Therefore the answer to your question depends on how you want to visualise it. The most obvious way for me is to say that each inner array represents a row in the matrix, therefore I would lean towards Double[3][5], then indexing with a row and column would look like array[row][column] which makes a lot of sense.

standardising an array of vectors

I have an ArrayList of lists of vectors.
Each list of vectors has three elements with an undefined amount of vectors i.e
x,y,z
x1,y1,z1
....N....
xN,yN,zN
But each list I have varies in length 'N' I was just wondering what would be a standard for making them all the same length 'N'. Initially I tried a sparse sampling approach but that didnt work as I miss alot of data and I need to keep as much data as possible. Are there any other methods?

Implementing Adaboost for multiple dimensions in Java

I'm working on AdaBoost implementation in Java.
It should have work for "double" coordinates on 2D 3D or 10D.
All I found for Java is for a binary data (0,1) and not for multi-dimensional space.
I'm currently looking for a way to represent the dimensions and to initialize the classifiers for boosting.
I'm looking for suggestions on how to represent the multidimensional space in Java, and how to initialize the classifiers to begin with.
The data is something in between [-15,+15]. And the target values are 1 or 2.
To use a boosted decision tree on spatial data, the typical approach is to try to find a "partition point" on some axis that minimizes the residual information in the two subtrees. To do this, you find some value along some axis (say, the x axis) and then split the data points into two groups - one group of points whose x coordinate is below that split point, and one group of points whose x coordinate is above that split point. That way, you convert the real-valued spatial data into 0/1 data - the 0 values are the ones below the split point, and the 1 values are the ones above the split point. The algorithm is thus identical to AdaBoost, except that when choosing the axis to split on, you also have to consider potential splitting points.
How about using JBoost, I think it's got what you're looking for.
Why don't you use a double[] array for each object? That is the common way of representing feature vectors in Java.

2D-Array : prefered way access items

So here I am tonight with this question that came up into my mind :
What is your favourite way to access the items of a m x n matrix
there is the normal way where you use an index for the columns
and another index for the rows matrix[i][j]
and there's another way where your matrix is a vector of length m*n
and you access the items using [i*n+j] as index number
tell me what method you prefeer most , are there any other methods
that would work for specific cases ?
Let's say we have this piece of C(++) code:
int x = 3;
int y = 4;
arr2d[x][y] = 0xFF;
arr1d[x*10+y] = 0xFF;
Where:
unsigned char arr2d[10][10];
unsigned char arr1d[10*10];
And now let's look at the compiled version of it (assembly; using debugger):
As you can see there's absolutely no penalty or slowdown when accessing array elements no matter if you're using 2D arrays or not, since both of the methods are actually the same.
There are only two reasons to go for the one-dimensional array to represent n-dimensions I can think of:
Performance: The usual way to allocate n-dimensional arrays means that we get n dimensions that may not necessarily be allocated in one piece - which isn't that great for spatial locality (and may also result in at least some additional memory accesses - in the worst case we need 1 additional read for each access). Now in C/C++ you can get around this (allocate memory in one piece, then afterwards specify the correct pointers; just be really careful not to forget this when you delete it) and other languages (C#) already can do this out of the box. Also note that in a language with a stop&copy GC the reasoning is unnecessary since all the objects will be allocated near each other anyhow. You avoid additional overhead for each single dimension though, so you use your memory and cache a bit better.
For some algorithms it's nicer to just use a one dimensional array which may make the code shorter and slightly faster - that's probably the one thing that can be argued as subjective here.
I think that if you need a 2D array, is because you would like to access it as a 2d array, not as a 1D array
Otherwise you can do a simple multiply to make it a 1D array
If I was to use a 2-D array, I would vote for matrix[i][j]. I think this is more readable. However, I might consider using Guava's Table class.
http://guava-libraries.googlecode.com/svn/trunk/javadoc/com/google/common/collect/Table.html
I don't think that your "favourite" way, or the most aesthetically pleasing way is a good approach to take with this issue - underlying performance would be my main concern.
Storing a matrix as a contiguous array is often the most efficient way of doing matrix calculations. If you take a look at optimised BLAS (Basic Linear Algebra Subroutine) libraries, such as the Intel MKL, the AMD ACML, ATLAS etc etc contiguous matrix storage will be used. When contiguous storage is used, and contiguous data access patterns are exploited higher performance can result due to the improved locality-of-reference (i.e. cache performance) of the operations.
In some languages (i.e. c++) you could use operator overloading to achieve the data[i][j] style of indexing while doing the 1D array index mappings behind the scenes.
Hope this helps.

Categories

Resources