Spark RDD, how to generate JavaRDD of length N? - java

(part of problem is docs that say "undocumented" on parallelize leave me reading books for examples that don't always pertain )
I am trying to create an RDD length N = 10^6 by executing N operations of a Java class we have, I can have that class implement Serializable or any Function if necessary. I don't have a fixed length dataset up front, I am trying to create one. Trying to figure out whether to create a dummy array of length N to parallelize, or pass it a function that runs N times.
Not sure which approach is valid/better, I see in Spark if I am starting out with a well defined data set like words in a doc, the length/count of those words is already defined and I just parallelize some map or filter to do some operation on that data.
In my case I think it's different, trying to parallelize the creation an RDD that will contain 10^6 elements...
DESCRIPTION:
In Java 8 using Spark 1.5.1, we have a Java method doDrop() that takes a PipeLinkageData and returns a DropResult.
I am thinking I could use map() or flatMap() to call a one to many function, I was trying to do something like this in another question that never quite worked:
JavaRDD<DropResult> simCountRDD = spark.parallelize(makeRange(1,getSimCount())).map(new Function<Integer, DropResult>()
{
public DropResult call(Integer i) {
return pld.doDrop();
}
});
Thinking something like this is more the correct approach?
// pld is of type PipeLinkageData, it's already initialized
// parallelize wants a collection passed into first param
List<PipeLinkageData> pldListofOne = new ArrayList();
// make an ArrayList of one
pldListofOne.add(pld);
int howMany = 1000000;
JavaRDD<DropResult> nSizedRDD = spark.parallelize(pldListofOne).flatMap(new FlatMapFunction<PipeLinkageData, DropResult>()
{
public Iterable<DropResult> call(PipeLinkageData pld) {
List<DropResult> returnList = new ArrayList();
// is Spark good at spreading a for loop like this?
for ( int i = 0; i < howMany ; i++ ){
returnList.add(pld.doDrop());
}
// EDIT changed from returnRDD to returnList
return returnList;
}
});
One other concern: A JavaRDD is corrrect here? I can see needing to call FlatMapFunction but I don't need a FlatMappedRDD? And since I am never trying to flatten a group of arrays or lists to a single array or list, do I really ever need to flatten anything?

The first approach should work as long as DropResult and can be serialized PipeLinkageData and there are no issues with its internal logic (like depending on a shared state).
The second approach in a current form doesn't make sense. A single record will be processed on a single partition. It means a whole process will be completely sequential and can crash if data doesn't fit in a single worker memory. Increasing number of elements should solve the problem but it doesn't improve on the first approach
Finally you can initialize an empty RDD and then use mapPartititions replacing FlatMapFunction with almost identical MapPartitionsFunction and generate required number of objects per partition.

Related

Java : Creating chunks of List for processing

I have a list with a large number of elements. While processing this list, in some cases I want the list to be partitioned into smaller sub-lists and in some cases I want to process the entire list.
private void processList(List<X> entireList, int partitionSize)
{
Iterator<X> entireListIterator = entireList.iterator();
Iterator<List<X>> chunkOfEntireList = Iterators.partition(entireListIterator, partitionSize);
while (chunkOfEntireList.hasNext()) {
doSomething(chunkOfEntireList.next());
if (chunkOfEntireList.hasNext()) {
doSomethingOnlyIfTheresMore();
}
}
I'm using com.google.common.collect.Iterators for creating partitions. Link of documentation here
So in cases where I want to partition the list with size 100, I call
processList(entireList, 100);
Now, when I don't want to create chunks of the list, I thought I could pass Integer.MAX_VALUE as partitionSize.
processList(entireList, Integer.MAX_VALUE);
But this leads to my code going out of memory. Can someone help me out? What am I missing? What is Iterators doing internally and how do I overcome this?
EDIT : I also require the "if" clause inside to do something only if there are more lists to process. i.e i require hasNext() function of the iterator.
You're getting an out of memory error because Iterators.partition() internally populates an array with the given partition length. The allocated array is always the partition size because the actual number of elements is not known until the iteration is complete. (The issue could have been prevented if they had used an ArrayList internally; I guess the designers decided that arrays would offer better performance in the common case.)
Using Lists.partition() will avoid the problem since it delegates to List.subList(), which is only a view of the underlying list:
private void processList(List<X> entireList, int partitionSize) {
for (List<X> chunk : Lists.partition(entireList, partitionSize)) {
doSomething(chunk);
}
}
Normally while partitioning it will allocate a new list with given partitionSize. So it is obvious in this case that there will be such error. Why don't you use the original list when you need only single partition. Possible solutions.
create a separate overloaded method where you won't take the size.
pass the size as -1 when you don't need any partition. In the method check the value, if -1 then put the original list into the chunkOfEntireList,.

Accessing Array.items[] in a for-loop error

With the aim of getting a better performance I'm fine tuning the code, looking through the DDMS tracer. One aspect is Array.get(x) which is more expensive than Array.items[x]
We can directly access the items proving the array type is Object, or, we specify the array type in the constructor, like so:
Array<MyClass> foo = new Array<MyClass>(MyClass.class)
This works fine, however, how do I specify the last MyClass.class in a for loop? I have this at the moment:
for (Array<MyClass> listOfObjects : allObjects) {
for (int i=0; i<listOfObjects.size; i++) {
MyClass myObj = listOfObjects.get(i);
//MyClass myObj = listOfObjects.items[i];
The commented line works fine, but trying to get rid of the overhead, I want to supply the `(MyClass.class)' like mentioned above. Where can I do this in that for-loop constructor?
Many thanks
J
I think that what you're trying to do is pointless. Please read this great article: http://blog.codinghorror.com/the-sad-tragedy-of-micro-optimization-theater/
You are trying to generate some minimal optimization, while at the same time greatly reducing readibility and maintainability.
If you want less overhead, it would probably be wiser to look at a language like C++, rather than trying to hack basic java for loops.
Another thing you may want to look into is Java 8, which has added functionality for executing loops concurrently with Streams.
Array<MyClass> foo = new Array<MyClass>(MyClass.class)
Note that you are creating a NEW array with this line, passing it a class argument. From http://libgdx.badlogicgames.com/nightlies/docs/api/com/badlogic/gdx/utils/Array.html
Array(java.lang.Class arrayType)
Creates an ordered array with items of the specified type and a capacity of 16.
I don't see you trying to create new Arrays in the other code you posted. Are you trying to populate each listOfObjects in allObjects?
If so, you would want to do something like:
for (int i = 0; i < allObjects.size; i++)
{
allObjects.items[i] = new Array<MyClass>(MyClass.class);
}
If you are simply trying to loop through these arrays, there is no class argument needed. I would suggest comparing the Array class to other Gdx or Java collections if the speed of iteration is too slow.
This quote from above link may also be notable if you do a lot of removing from the arrays.
A resizable, ordered or unordered array of objects. If unordered, this class avoids a memory copy when removing elements (the last element is moved to the removed element's position).

What's more efficient and compact: A huge set of linkedlist variables or a two-dimensional arraylist containing each of these?

I want to create a large matrix (n by n) where each element corresponds to a LinkedList (of certain objects).
I can either
Create the n*n individual linked lists and name them with the help of eval() within a loop that iterates through both dimensions (or something similar), so that in the end I'll have LinkedList_1_1, LinkedList_1_2 etc. Each one has a unique variable name. Basically, skipping the matrix altogether.
Create an ArrayList of ArrayLists and then push into each element a linked list.
Please recommend me a method if I want to conserve time & space, and ease-of-access in my later code, when I want to reference individual LinkedLists. Ease-of-acess will be poor with Method 1, as I'll have to use eval whenever I want to access a particular linked list.
My gut-feeling tells me Method 2 is the best approach, but how exactly do I form my initializations?
As you know the sizes to start with, why don't you just use an array? Unfortunately Java generics prevents the array element itself from being a concrete generic type, but you can use a wildcard:
LinkedList<?>[][] lists = new LinkedList<?>[n][n];
Or slightly more efficient in memory, just a single array:
LinkedList<?>[] lists = new LinkedList<?>[n * n];
// Then for access...
lists[y * n + x] = ...;
Then you'd need to cast on each access - using #SuppressWarnings given that you know it will always work (assuming you encapsulate it appropriately). I'd put that in a single place:
#SuppressWarnings("unchecked")
private LinkedList<Foo> getList(int x, int y) {
if (lists[y][x] == null) {
lists[y][x] = new LinkedList<Foo>();
}
// Cast won't actually have any effect at execution time. It's
// just to tell the compiler we know what we're doing.
return (LinkedList<Foo>) lists[y][x];
}
Of course in both cases you'd then need to populate the arrays with empty linked lists if you needed to. (If several of the linked lists never end up having any nodes, you may wish to consider only populating them lazily.)
I would certainly not generate a class with hundreds of variables. It would make programmatic access to the lists very painful, and basically be a bad idea in any number of ways.

Converting to a column oriented array in Java

Although I have Java in the title, this could be for any OO language.
I'd like to know a few new ideas to improve the performance of something I'm trying to do.
I have a method that is constantly receiving an Object[] array. I need to split the Objects in this array through multiple arrays (List or something), so that I have an independent list for each column of all arrays the method receives.
Example:
List<List<Object>> column-oriented = new ArrayList<ArrayList<Object>>();
public void newObject(Object[] obj) {
for(int i = 0; i < obj.length; i++) {
column-oriented.get(i).add(obj[i]);
}
}
Note: For simplicity I've omitted the initialization of objects and stuff.
The code I've shown above is slow of course. I've already tried a few other things, but would like to hear some new ideas.
How would you do this knowing it's very performance sensitive?
EDIT:
I've tested a few things and found that:
Instead of using ArrayList (or any other Collection), I wrapped an Object[] array in another object to store individual columns. If this array reaches its capacity, I create another array with twice de size and copy the contents from one to another using System.copyArray. Surprisingly (at least for me) this is faster that using ArrayList to store the inner columns...
The answer depends on the data and usage profile. How much data do you have in such collections? What is proportions of reads/writes (adding objects array)? This affects what structure for inner list is better and many other possible optimizations.
The fastest way to copy data is avoid copying at all. If you know that obj array is not modified further by the caller code (this is important condition), one of possible tricks is to implement you custom List class to use as inner list. Internally you will store shared List<Object[]>. Each call we just add new array to that list. Custom inner list class will know which column it represents (let it be n) and when it is asked to give item at position m, it will transpose m and n and query internal structure to get internalArray.get(m)[n]. This implementation is unsafe because of limitation on the caller that is easy to forget about but might be faster under some conditions (however, this might be slower under other).
I would try using LinkedList for the inner list, because it should have better performance for insertions. Maybe wrappping Object arra into collection and using addAll might help as well.
ArrayList may be slow, due to copying of arrays (It uses a similar approach as your self-written collection).
As an alternate solution you could try to simply store the Rows at first and create columns when neccessary. This way, copying of the internal arrays at the list is reduced to a minimum.
Example:
//Notice: You can use a LinkedList for rows, as no index based access is used.
List<Object[]> rows =...
List<List<Object>> columns;
public void processColumns() {
columns = new ArrayList<List<Object>>();
for(Object[] aRow : rows){
while (aRow.size() > columns.size()){
//This ensures that the ArrayList is big enough, so no copying is necessary
List<Object> newColumn = new ArrayList<Object>(rows.size())
columns.add(newColumn);
}
for (int i = 0; i < aRow.length; i++){
columns.get(i).add(aRow[i]);
}
}
}
Depending on the number of columns, it's still possible that the outer list is copying arrays internally, but normal tables contains far more rows than columns, so it should be a small array only.
Use a LinkedList for implementing the column lists. It's grows linearly with the data and is O(1). (If you use ArrayList it has to resize the internal array from time to time).
After collecting the values you can convert that linked lists to arrays. If N is the number of rows you will pass from holding 3*N refs for each list (each LInkedList has prevRef/nextRef/itemRef) to only N refs.
It would be nice to have an array for holding the different column lists, but of course, it's not a big improvement and you can do it only if you know the column count in advance.
Hope it helps!
Edit tests and theory indicate that ArrayList is better in amortized cost, it is, the total cost divided by the number of items processed... so don't follow my 'advice' :)

How can I create a java list using the member variables of an existing list, without using a for loop?

I have a java list
List<myclass> myList = myClass.selectFromDB("where clause");
//myClass.selectFromDB returns a list of objects from DB
But I want a different list, specifically.
List<Integer> goodList = new ArrayList<Integer>();
for(int i = 0;i++; i<= myList.size()) {
goodList[i] = myList[i].getGoodInteger();
}
Yes, I could do a different query from the DB in the initial myList creation, but assume for now I must use that as the starting point and no other DB queries. Can I replace the for loop with something much more efficient?
Thank you very much for any input, apologies for my ignorance.
In order to extract a field from the "myclass", you're going to have to loop through the entire contents of the list. Whether you do that with a for loop, or use some sort of construct that hides the for loop from you, it's still going to take approximately the same time and use the same amount of resources.
An important question is: why do you want to do this? Are you trying to make your code cleaner? If so, you could write a method along these lines:
public static List<Integer> extractGoodInts (List<myClass> myList) {
List<Integer> goodInts = new ArrayList<Integer>();
for(int i = 0; i < myList.size(); i++){
goodInts.add(myList.get(i).getGoodInteger());
}
return goodInts;
}
Then, in your code, you can just go:
List<myClass> myList = myClass.selectFromDB("where clause");
List<Integer> goodInts = myClass.extractGoodInts(myList);
However, if you're trying to make your code more efficient and you're not allowed to change the query, you're out of luck; somehow or another, you're going to need to individually grab each int from the list, which means you're going to be running in O(n) time no matter what clever tricks you can come up with.
There are really only two ways I can think of that you can make this more "efficient":
Somehow split this up between multiple cores so you can do the work in parallel. Of course, this assumes that you've got other cores, they aren't doing anything useful already, and that there's enough processing going on that the overheard of doing this is even worth it. My guess is that (at least) the last point isn't true in your case given that you're just calling a getter. If you wanted to do this you'd try to have a number of threads (I'd probably actually use an Executor and Futures for this) equal to the number of cores, and then give roughly equal amounts of work to each of them (probably just by slicing your list into roughly equal sized pieces).
If you believe that you'll only be accessing a small subset of the resulting List, but are unsure of exactly which elements, you could try doing things lazily. The easiest way to do that would be to use a pre-built lazy mapping List implementation. There's one in Google Collections Library. You use it by calling Lists.transform(). It'll immediately return a List, but it'll only perform your transformation on elements as they are requested. Again, this is only more efficient if it turns out that you only ever look at a small fraction of the output List. If you end up looking at the entire thing this will not be more efficient, and will probably work out to be less efficient.
Not sure what you mean by efficient. As the others said, you have to call the getGoodInteger method on every element of that list one way or another. About the best you can do is avoid checking the size every time:
List<Integer> goodInts = new ArrayList<Integer>();
for (MyClass myObj : myList) {
goodInts.add(myObj.getGoodInteger());
}
I also second jboxer's suggestion of making a function for this purpose.

Categories

Resources