Java : Creating chunks of List for processing

Java : Creating chunks of List for processing - java

I have a list with a large number of elements. While processing this list, in some cases I want the list to be partitioned into smaller sub-lists and in some cases I want to process the entire list.
private void processList(List<X> entireList, int partitionSize)
{
Iterator<X> entireListIterator = entireList.iterator();
Iterator<List<X>> chunkOfEntireList = Iterators.partition(entireListIterator, partitionSize);
while (chunkOfEntireList.hasNext()) {
doSomething(chunkOfEntireList.next());
if (chunkOfEntireList.hasNext()) {
doSomethingOnlyIfTheresMore();
}
}
I'm using com.google.common.collect.Iterators for creating partitions. Link of documentation here
So in cases where I want to partition the list with size 100, I call
processList(entireList, 100);
Now, when I don't want to create chunks of the list, I thought I could pass Integer.MAX_VALUE as partitionSize.
processList(entireList, Integer.MAX_VALUE);
But this leads to my code going out of memory. Can someone help me out? What am I missing? What is Iterators doing internally and how do I overcome this?
EDIT : I also require the "if" clause inside to do something only if there are more lists to process. i.e i require hasNext() function of the iterator.

You're getting an out of memory error because Iterators.partition() internally populates an array with the given partition length. The allocated array is always the partition size because the actual number of elements is not known until the iteration is complete. (The issue could have been prevented if they had used an ArrayList internally; I guess the designers decided that arrays would offer better performance in the common case.)
Using Lists.partition() will avoid the problem since it delegates to List.subList(), which is only a view of the underlying list:
private void processList(List<X> entireList, int partitionSize) {
for (List<X> chunk : Lists.partition(entireList, partitionSize)) {
doSomething(chunk);
}
}

Normally while partitioning it will allocate a new list with given partitionSize. So it is obvious in this case that there will be such error. Why don't you use the original list when you need only single partition. Possible solutions.
create a separate overloaded method where you won't take the size.
pass the size as -1 when you don't need any partition. In the method check the value, if -1 then put the original list into the chunkOfEntireList,.

Related

Are there better ways to make an ArrayList bigger?

I am trying to use an ArrayList to store and retrieve items by an index value. My code is similar to this:
ArrayList<Object> items = new ArrayList<>();
public void store (int index, Object item)
{
while(items.size() < index) items.add(null);
items.set(index, item);
}
The loop seems ugly and I would like to use items.setMinimumSize(index + 1) but it does not exist. I could use items.addAll(Arrays.asList(new Object[index - items.size()])) but that allocates memory and seems overly complex as a way to just make an array bigger. Most of the time the loop will iterate zero times, so I prefer simplicity over speed.
The index values probably won't exceed 200, so I could use:
Object[] items = new Object[200];
public void store (int index, Object item)
{
items[index] = item;
}
but this would break if it ever needs over 200 values.
Are these really my only options? It seems that I am forced into something more complex than it should be.

I would consider using a Map instead of a List construct. Why not this :
//depending on your requirements, you might want to use another Map
//implementation, just read through the docs...
Map<Integer, Object> items = new HashMap<>();
public void store (int index, Object item)
{
items.put(index, item);
}
This way you can avoid that senseless for loop.

The problem is, what you want isn't really an arraylist. It feels like what you want is a Map<Integer, T> instead of an ArrayList<T>, as you clearly want to map an index to a value; you want this operation:
Add a value to this list such that list.get(250) returns it.
And that is possible with arraylist, but not really what it is for, and when you use things in a way that wasn't intended, what usually ends up happening is that you write a bunch of weird code and go: "really? Is this right?" - and so it is here.
There's nothing particularly wrong with doing what you're doing, given that you said the indices aren't going to go much beyond 200, but, generally, I advise not creating semantic noise by using things for what they aren't. If you must, create an entirely new type that encapsulates exactly this notion of a sparse list. If you must, implement it by using an arraylist under the hood, that'd be fine.
Alternatively, use something that already exists.
a map
From the core library, why not.. new TreeMap<Integer, T>()? treemaps keep themselves sorted (so if you loop through its keys, you get them in order; if you add 5, then 200, then 20, you get the keys in order '5, 20, 200' as you'd expect. The performance is O(1) or O(log n), but with a higher runup (this is extremely unlikely to matter one iota if you have a collection of ~200 items in it. Wake me up when you add a million, then performance might be even measurable, let alone noticable) - and you can toss a 'key' of a few million at it if you want, no problem. The 'worst case scenario' is far better here, you basically cannot cause this to be a problem, whereas with a sparse, array-backed list, if I tossed an index of 3 billion at it, you would then have a few GB worth of blank memory allocated; you'd definitely notice that!
A sparse list
java itself doesn't have sparse lists, but other libraries do. Search the web for something you like, add it as a dependency, and keep on going.

The loop seems ugly and I would like to use items.setMinimumSize(index + 1) but it does not exist.
A List contains a sequence of references, without gaps (but possibly with null elements). Indexes into a list correlate with that sequence, and the list size is defined as the number of elements it contains. You cannot manipulate elements with indexes greater than or equal to the list's current size.
The size is not to be confused with the capacity of an ArrayList, which is the number of elements it presently is able to accommodate without acquiring additional storage. To a first approximation, you can and should ignore ArrayList capacity. Working with that makes your code specific to ArrayList, rather than general to Lists, and it's mostly an issue of optimization.
Thus, if you want to increase the size of an ArrayList (or most other kinds of List) so as to ensure that a certain index is valid for it, then the only alternative is to add elements to make it at least that large. But you don't need to add them one at a time in a loop. I actually like your
items.addAll(Arrays.asList(new Object[index - items.size()]))
, but you need one more element. The size needs to be at least index + 1 in order to set() the element at index index.
Alternatively, you could use
items.addAll(Collections.nCopies(1 + index - items.size(), null));
That might even be cheaper, too.

Java OutOfMemory during sorting

I have a task about building a pyramid using list of numbers, but there is one problem with one test. In my task I need to sort a list. I use Collections.sort():
Collections.sort(inputNumbers, (o1, o2) -> {
if (o1 != null && o2 != null) {
return o1.compareTo(o2);
} else {
throw new CannotBuildPyramidException("Unable to build a pyramid");
}
});
But this test fails
#Test(expected = CannotBuildPyramidException.class)
public void buildPyramid8() {
// given
List<Integer> input = Collections.nCopies(Integer.MAX_VALUE - 1, 0);
// run
int[][] pyramid = pyramidBuilder.buildPyramid(input);
// assert (exception)
}
with OutOfMemoryError instead of my own CannotBuildPyramidException(it will be thrown in another method after sorting). I understand that it is because of TimSort in Collections.sort() method. I tried to use HeapSort, but I couldn`t even swap elements because my input list was initialized as Arrays.asList() and when I use set() method I get UnsupportedOperationException. Then I tried to convert my list to common ArrayList
ArrayList<Integer> list = new ArrayList<>(inputNumbers);
but I got OutOfMemoryError again. It`s not allowed to edit tests. I dont know what to do with this problem. Im using Java8 and IntelliJIdea SDK

Note that the list created by Collections.nCopies(Integer.MAX_VALUE - 1, 0) uses a tiny amount of memory and is immutable. The documentation says "Returns an immutable list consisting of n copies of the specified object. The newly allocated data object is tiny (it contains a single reference to the data object)". And if you look at the implementation, you'll see it does exactly what one would expect from that description. It returns a List object that only pretends to be large, only holding the size and the element once and returning that element when asked about any index.
The problem with Collections.sort is then two-fold:
The list must not be immutable, but that list is. That btw also explains the UnsupportedOperationException you got when you tried to set().
For performance reasons, it "obtains an array containing all elements in this list, sorts the array, [and writes back to the list]". So at this point the tiny pretend-list does get blown up and causes your memory problem.
So you need to find some other way to sort. One that works in-place and doesn't swap anything for this input (which is correct, as the list is already sorted). You could for example use bubble sort, which takes O(n) time and O(1) space on this input and doesn't attempt any swaps here.
Btw, about getting the memory problem "because of TimSort": Timsort is really not to blame. You don't even get to the Timsort part, as it's the preparatory copy-to-array that causes the memory problem. And furthermore, Timsort is smart and would detect that the data is already sorted and then wouldn't do anything. So if you actually did get to the Timsort part, or if you could directly apply it to the list, Timsort wouldn't cause a problem.

This list is too huge! Collections.nCopies(Integer.MAX_VALUE - 1, 0); gives us list of 2^31-1 elements (2147483647), each one taking about 4 bytes in memory (this is "simplified" size of Integer). If we multiply it, we'll have about 8.59 GB of memory required to store all those numbers. Are you sure you have enough memory to store it?
I believe this test is written in a very bad manner - one should never try to create such huge List.

Spark RDD, how to generate JavaRDD of length N?

(part of problem is docs that say "undocumented" on parallelize leave me reading books for examples that don't always pertain )
I am trying to create an RDD length N = 10^6 by executing N operations of a Java class we have, I can have that class implement Serializable or any Function if necessary. I don't have a fixed length dataset up front, I am trying to create one. Trying to figure out whether to create a dummy array of length N to parallelize, or pass it a function that runs N times.
Not sure which approach is valid/better, I see in Spark if I am starting out with a well defined data set like words in a doc, the length/count of those words is already defined and I just parallelize some map or filter to do some operation on that data.
In my case I think it's different, trying to parallelize the creation an RDD that will contain 10^6 elements...
DESCRIPTION:
In Java 8 using Spark 1.5.1, we have a Java method doDrop() that takes a PipeLinkageData and returns a DropResult.
I am thinking I could use map() or flatMap() to call a one to many function, I was trying to do something like this in another question that never quite worked:
JavaRDD<DropResult> simCountRDD = spark.parallelize(makeRange(1,getSimCount())).map(new Function<Integer, DropResult>()
{
public DropResult call(Integer i) {
return pld.doDrop();
}
});
Thinking something like this is more the correct approach?
// pld is of type PipeLinkageData, it's already initialized
// parallelize wants a collection passed into first param
List<PipeLinkageData> pldListofOne = new ArrayList();
// make an ArrayList of one
pldListofOne.add(pld);
int howMany = 1000000;
JavaRDD<DropResult> nSizedRDD = spark.parallelize(pldListofOne).flatMap(new FlatMapFunction<PipeLinkageData, DropResult>()
{
public Iterable<DropResult> call(PipeLinkageData pld) {
List<DropResult> returnList = new ArrayList();
// is Spark good at spreading a for loop like this?
for ( int i = 0; i < howMany ; i++ ){
returnList.add(pld.doDrop());
}
// EDIT changed from returnRDD to returnList
return returnList;
}
});
One other concern: A JavaRDD is corrrect here? I can see needing to call FlatMapFunction but I don't need a FlatMappedRDD? And since I am never trying to flatten a group of arrays or lists to a single array or list, do I really ever need to flatten anything?

The first approach should work as long as DropResult and can be serialized PipeLinkageData and there are no issues with its internal logic (like depending on a shared state).
The second approach in a current form doesn't make sense. A single record will be processed on a single partition. It means a whole process will be completely sequential and can crash if data doesn't fit in a single worker memory. Increasing number of elements should solve the problem but it doesn't improve on the first approach
Finally you can initialize an empty RDD and then use mapPartititions replacing FlatMapFunction with almost identical MapPartitionsFunction and generate required number of objects per partition.

What's more efficient and compact: A huge set of linkedlist variables or a two-dimensional arraylist containing each of these?

I want to create a large matrix (n by n) where each element corresponds to a LinkedList (of certain objects).
I can either
Create the n*n individual linked lists and name them with the help of eval() within a loop that iterates through both dimensions (or something similar), so that in the end I'll have LinkedList_1_1, LinkedList_1_2 etc. Each one has a unique variable name. Basically, skipping the matrix altogether.
Create an ArrayList of ArrayLists and then push into each element a linked list.
Please recommend me a method if I want to conserve time & space, and ease-of-access in my later code, when I want to reference individual LinkedLists. Ease-of-acess will be poor with Method 1, as I'll have to use eval whenever I want to access a particular linked list.
My gut-feeling tells me Method 2 is the best approach, but how exactly do I form my initializations?

As you know the sizes to start with, why don't you just use an array? Unfortunately Java generics prevents the array element itself from being a concrete generic type, but you can use a wildcard:
LinkedList<?>[][] lists = new LinkedList<?>[n][n];
Or slightly more efficient in memory, just a single array:
LinkedList<?>[] lists = new LinkedList<?>[n * n];
// Then for access...
lists[y * n + x] = ...;
Then you'd need to cast on each access - using #SuppressWarnings given that you know it will always work (assuming you encapsulate it appropriately). I'd put that in a single place:
#SuppressWarnings("unchecked")
private LinkedList<Foo> getList(int x, int y) {
if (lists[y][x] == null) {
lists[y][x] = new LinkedList<Foo>();
}
// Cast won't actually have any effect at execution time. It's
// just to tell the compiler we know what we're doing.
return (LinkedList<Foo>) lists[y][x];
}
Of course in both cases you'd then need to populate the arrays with empty linked lists if you needed to. (If several of the linked lists never end up having any nodes, you may wish to consider only populating them lazily.)
I would certainly not generate a class with hundreds of variables. It would make programmatic access to the lists very painful, and basically be a bad idea in any number of ways.

How can I create a java list using the member variables of an existing list, without using a for loop?

I have a java list
List<myclass> myList = myClass.selectFromDB("where clause");
//myClass.selectFromDB returns a list of objects from DB
But I want a different list, specifically.
List<Integer> goodList = new ArrayList<Integer>();
for(int i = 0;i++; i<= myList.size()) {
goodList[i] = myList[i].getGoodInteger();
}
Yes, I could do a different query from the DB in the initial myList creation, but assume for now I must use that as the starting point and no other DB queries. Can I replace the for loop with something much more efficient?
Thank you very much for any input, apologies for my ignorance.

In order to extract a field from the "myclass", you're going to have to loop through the entire contents of the list. Whether you do that with a for loop, or use some sort of construct that hides the for loop from you, it's still going to take approximately the same time and use the same amount of resources.

An important question is: why do you want to do this? Are you trying to make your code cleaner? If so, you could write a method along these lines:
public static List<Integer> extractGoodInts (List<myClass> myList) {
List<Integer> goodInts = new ArrayList<Integer>();
for(int i = 0; i < myList.size(); i++){
goodInts.add(myList.get(i).getGoodInteger());
}
return goodInts;
}
Then, in your code, you can just go:
List<myClass> myList = myClass.selectFromDB("where clause");
List<Integer> goodInts = myClass.extractGoodInts(myList);
However, if you're trying to make your code more efficient and you're not allowed to change the query, you're out of luck; somehow or another, you're going to need to individually grab each int from the list, which means you're going to be running in O(n) time no matter what clever tricks you can come up with.

There are really only two ways I can think of that you can make this more "efficient":
Somehow split this up between multiple cores so you can do the work in parallel. Of course, this assumes that you've got other cores, they aren't doing anything useful already, and that there's enough processing going on that the overheard of doing this is even worth it. My guess is that (at least) the last point isn't true in your case given that you're just calling a getter. If you wanted to do this you'd try to have a number of threads (I'd probably actually use an Executor and Futures for this) equal to the number of cores, and then give roughly equal amounts of work to each of them (probably just by slicing your list into roughly equal sized pieces).
If you believe that you'll only be accessing a small subset of the resulting List, but are unsure of exactly which elements, you could try doing things lazily. The easiest way to do that would be to use a pre-built lazy mapping List implementation. There's one in Google Collections Library. You use it by calling Lists.transform(). It'll immediately return a List, but it'll only perform your transformation on elements as they are requested. Again, this is only more efficient if it turns out that you only ever look at a small fraction of the output List. If you end up looking at the entire thing this will not be more efficient, and will probably work out to be less efficient.

Not sure what you mean by efficient. As the others said, you have to call the getGoodInteger method on every element of that list one way or another. About the best you can do is avoid checking the size every time:
List<Integer> goodInts = new ArrayList<Integer>();
for (MyClass myObj : myList) {
goodInts.add(myObj.getGoodInteger());
}
I also second jboxer's suggestion of making a function for this purpose.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.