multi-dimensional sparse array in kotlin (jvm, js)? - java

I am searching for library implementing sparse multi-dimensional array for kotlin on jvm and js. There is Sparse Array implementation in android.utils, but can it be used with JVM / JS ?
Or there is something in core library and I a m just missing it?

How about using a spatial index, such as quadtrees, R-trees, kd-trees?
I am not aware of any Kotlin library, but if you can use Java, have a look at TinSpin index library.
Especially if the positions are integers and/or unique, I can suggest the PH-Tree which has by default unique positions but can also be used as multimap (simply insert a map/list/set into any occupied position).
Disclaimer: Self-advertisement - I am the developer of TinSpin as well as the PH-Tree.

Related

spark - How to reduce the shuffle size of a JavaPairRDD<Integer, Integer[]>?

I have a JavaPairRDD<Integer, Integer[]> on which I want to perform a groupByKey action.
The groupByKey action gives me a:
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle
which is practically an OutOfMemory error, if I am not mistaken. This occurs only in big datasets (in my case when "Shuffle Write" shown in the Web UI is ~96GB).
I have set:
spark.serializer org.apache.spark.serializer.KryoSerializer
in $SPARK_HOME/conf/spark-defaults.conf, but I am not sure if Kryo is used to serialize my JavaPairRDD.
Is there something else that I should do to use Kryo, apart from setting this conf parameter, to serialize my RDD? I can see in the serialization instructions that:
Spark automatically includes Kryo serializers for the many commonly-used core Scala classes covered in the AllScalaRegistrar from the Twitter chill library.
and that:
Since Spark 2.0.0, we internally use Kryo serializer when shuffling RDDs with simple types, arrays of simple types, or string type.
I also noticed that when I set spark.serializer to be Kryo, the Shuffle Write in the Web UI increases from ~96GB (with default serializer) to 243GB!
EDIT: In a comment, I was asked about the logic of my program, in case groupByKey can be replaced with reduceByKey. I don't think it's possible, but here it is anyway:
Input has the form:
key: index bucket id,
value: Integer array of entity ids in this bucket
The shuffle write operation produces pairs in the form:
entityId
Integer array of all entity Ids in the same bucket (call them neighbors)
The groupByKey operation gathers all the neighbor arrays of each entity, some possibly appearing more than once (in many buckets).
After the groupByKey operation, I keep a weight for each bucket (based on the number of negative entity ids it contains) and for each neighbor id I sum up the weights of the buckets it belongs to.
I normalize the scores of each neighbor id with another value (let's say it's given) and emit the top-3 neighbors per entity.
The number of distinct keys that I get is around 10 million (around 5 million positive entity ids and 5 million negatives).
EDIT2: I tried using Hadoop's Writables (VIntWritable and VIntArrayWritable extending ArrayWritable) instead of Integer and Integer[], respectively, but the shuffle size was still bigger than the default JavaSerializer.
Then I increased the spark.shuffle.memoryFraction from 0.2 to 0.4 (even if deprecated in version 2.1.0, there is no description of what should be used instead) and enabled offHeap memory, and the shuffle size was reduced by ~20GB. Even if this does what the title asks, I would prefer a more algorithmic solution, or one that includes a better compression.
Short Answer: Use fastutil and maybe increase spark.shuffle.memoryFraction.
More details:
The problem with this RDD is that Java needs to store Object references, which consume much more space than primitive types. In this example, I need to store Integers, instead of int values. A Java Integer takes 16 bytes, while a primitive Java int takes 4 bytes. Scala's Int type, on the other hand, is a 32-bit (4-byte) type, just like Java's int, that's why people using Scala may not have faced something similar.
Apart from increasing the spark.shuffle.memoryFraction to 0.4, another nice solution was to use the fastutil library, as suggest in Spark's tuning documentation:
The first way to reduce memory consumption is to avoid the Java features that add overhead, such as pointer-based data structures and wrapper objects. There are several ways to do this: Design your data structures to prefer arrays of objects, and primitive types, instead of the standard Java or Scala collection classes (e.g. HashMap). The fastutil library provides convenient collection classes for primitive types that are compatible with the Java standard library.
This enables storing each element in int array of my RDD pair as an int type (i.e., using 4 bytes instead of 16 for each element of the array). In my case, I used IntArrayList instead of Integer[]. This made the shuffle size drop significantly and allowed my program to run in the cluster. I also used this library in other parts of the code, where I was making some temporary Map structures. Overall, by increasing spark.shuffle.memoryFraction to 0.4 and using fastutil library, shuffle size dropped from 96GB to 50GB (!) using the default Java serializer (not Kryo).
Alternative: I have also tried sorting each int array of an rdd pair and storing the deltas using Hadoop's VIntArrayWritable type (smaller numbers use less space than bigger numbers), but this also required to register VIntWritable and VIntArrayWritable in Kryo, which didn't save any space after all. In general, I think that Kryo only makes things work faster, but does not decrease the space needed, but I am not still sure about that.
I am not marking this answer as accepted yet, because someone else might have a better idea, and because I didn't use Kryo after all, as my OP was asking. I hope reading it, will help someone else with the same issue. I will update this answer, if I manage to further reduce the shuffle size.
Still not really sure what you want to do. However, because you use groupByKey and say that there is no way to do it by using reduceByKey, it makes me more confused.
I think you have rdd = (Integer, Integer[]) and you want something like (Integer, Iterable[Integer[]]) that's why you are using groupByKey.
Anyway, I am not really familiar with Java in Spark, but in Scala I would use reduceByKey to avoid the shuffle by
rdd.mapValues(Iterable(_)).reduceByKey(_++_) . Basically, you want to convert the value to a list of array and then combine the list together.
I think the best approach that can be recommended here (without more specific knowledge of the input data) in general is to use the persist API on your input RDD.
As step one, I'd try to call .persist(MEMORY_ONLY_SER) on the input, RDD to lower memory usage (albeit at a certain CPU overhead, that shouldn't be that much of a problem for ints in your case).
If that is not sufficient you can try out .persist(MEMORY_AND_DISK_SER) or if your shuffle still takes so much memory that the input dataset needs to be made easier on the memory .persist(DISK_ONLY) may be an option, but one that will strongly deteriorate performance.

Porting from Python to Java: looking for a fast implementation of an array search operation

I am porting some Python code into Java and need to find an efficient implementation of the following Python operation:
sum([x in lstLong for x in lstShort])
where lstLong and lstShort are string lists. So here I am counting how many elements from lstShort appear in lstLong. In the Java implementation, both of these are String[]. What's the fastest way of counting this sum?
A more general question is whether there are any generic guidelines for writing "fast equivalents" in Java of Python list comprehension operations like the one above. I am mostly interested in arrays/lists of Strings, in case it makes the answer easier.
If you convert both to ArrayList<String>, you can use
lstLong.retainAll(lstShort).size();
You can do the conversion with
new ArrayList<String>(Arrays.asList(var));

Java library that has matlab equivalent functions

I am looking for a Java library that closely mirrors matlab's Matrix functions and possibly other functions in the areas of polynomial interpolation, etc.
If such a library does not exist I was toying with the idea of building my own but using an existing Matrix or scientific computing library to do the heavy lifting - if I were to do that which libraries would be candidates to serve as backends for such an effort
Eigen, one of the most used (and fastest) library for matrix computation in C++, has a java wrapper: jeigen.
It allows one to manipulate full and sparse matrices and make operations one them. It can be also worth trying.
Check out the following resources/packages
http://math.nist.gov/javanumerics/jama/
http://www.jscience.org/
Try to look at la4j (Linear Algebra for Java). It supports dense matrices as well as sparse ones. Here is just a brief example of using functional features of la4j:
// reads the dense matrix from the CSV file
Matrix a = new Basic2DMatrix(Mattrices.asSymbolSeparatedSource("matrix.csv", ","));
// calculates the sum of all elements of the matrix 'a'
double sum = a.fold(Matrices.asSumAccumulator(0));
// creates a new matrix 'b', that contains elements of matrix 'a' multiplied by '2'.
Matrix b = a.transform(Matrices.asMulFunction(2));
The best way to get the last version of la4j - visit it's GitHub page.
I use the Colt library for matrix operations.
See in: http://acs.lbl.gov/software/colt/api/index.html
I think it's really good and easy to use and is better than Apache Commons-Math and EJML that I have already tried.
I suggest you try all of the libraries mentioned and choose the one that is closer to your needs.

2D-Array : prefered way access items

So here I am tonight with this question that came up into my mind :
What is your favourite way to access the items of a m x n matrix
there is the normal way where you use an index for the columns
and another index for the rows matrix[i][j]
and there's another way where your matrix is a vector of length m*n
and you access the items using [i*n+j] as index number
tell me what method you prefeer most , are there any other methods
that would work for specific cases ?
Let's say we have this piece of C(++) code:
int x = 3;
int y = 4;
arr2d[x][y] = 0xFF;
arr1d[x*10+y] = 0xFF;
Where:
unsigned char arr2d[10][10];
unsigned char arr1d[10*10];
And now let's look at the compiled version of it (assembly; using debugger):
As you can see there's absolutely no penalty or slowdown when accessing array elements no matter if you're using 2D arrays or not, since both of the methods are actually the same.
There are only two reasons to go for the one-dimensional array to represent n-dimensions I can think of:
Performance: The usual way to allocate n-dimensional arrays means that we get n dimensions that may not necessarily be allocated in one piece - which isn't that great for spatial locality (and may also result in at least some additional memory accesses - in the worst case we need 1 additional read for each access). Now in C/C++ you can get around this (allocate memory in one piece, then afterwards specify the correct pointers; just be really careful not to forget this when you delete it) and other languages (C#) already can do this out of the box. Also note that in a language with a stop&copy GC the reasoning is unnecessary since all the objects will be allocated near each other anyhow. You avoid additional overhead for each single dimension though, so you use your memory and cache a bit better.
For some algorithms it's nicer to just use a one dimensional array which may make the code shorter and slightly faster - that's probably the one thing that can be argued as subjective here.
I think that if you need a 2D array, is because you would like to access it as a 2d array, not as a 1D array
Otherwise you can do a simple multiply to make it a 1D array
If I was to use a 2-D array, I would vote for matrix[i][j]. I think this is more readable. However, I might consider using Guava's Table class.
http://guava-libraries.googlecode.com/svn/trunk/javadoc/com/google/common/collect/Table.html
I don't think that your "favourite" way, or the most aesthetically pleasing way is a good approach to take with this issue - underlying performance would be my main concern.
Storing a matrix as a contiguous array is often the most efficient way of doing matrix calculations. If you take a look at optimised BLAS (Basic Linear Algebra Subroutine) libraries, such as the Intel MKL, the AMD ACML, ATLAS etc etc contiguous matrix storage will be used. When contiguous storage is used, and contiguous data access patterns are exploited higher performance can result due to the improved locality-of-reference (i.e. cache performance) of the operations.
In some languages (i.e. c++) you could use operator overloading to achieve the data[i][j] style of indexing while doing the 1D array index mappings behind the scenes.
Hope this helps.

Compressed SortedSet<Long> implementation

I need to store a large number of Long values in a SortedSet implementation in a space-efficient manner. I was considering bit-set implementations and discovered Javaewah. However, the API expects int values rather than longs.
Can anyone recommend any alternatives or suggest a good way to solve this problem? I am mainly concerned with space efficiency. Upon building the set I will need to access the minimum and maximum element once. However, access time is not a huge concern (i.e. so a fully run-length encoded implementation will be fine).
EDIT
I should be clear that the implementation does not have to implement the SortedSet interface providing I can access the minimum and maximum elements of the collection.
You could use TLongArrayList which uses a long[] underneath. It supports sort() so the min and max will be the first and last value.
Or you could use a long[] with a length and do this yourself. ;)
This will use about 64 byte more than the raw values themselves. You can get more compact if you can make some assumptions about the range of long values. e.g. if they are actually limited to 48-bit.
You might consider using LongBuffer. If it is memory mapped it avoids using heap or direct memory, but you would have implement a sort routine yourself.
If they are clustered, you might be able to represent the data as a Set of ranges. The ranges could be a pure A - B, or a BitSet with a starting value. The later works well for phone numbers. ;)
Not sure if it has Set or how efficient it is compared to regular JCF, but take a look at this:
http://commons.apache.org/primitives/

Categories

Resources