How to reset Iterator on MapReduce Function in Apache Spark

How to reset Iterator on MapReduce Function in Apache Spark - java

I'm a newbie with Apache-Spark. I wanna know how to reset the pointer to Iterator in MapReduce function in Apache Spark so that I wrote
Iterator<Tuple2<String,Set<String>>> iter = arg0;
but it isn't working. Following is a class implementing MapReduce function in java.
class CountCandidates implements Serializable,
PairFlatMapFunction<Iterator<Tuple2<String,Set<String>>>, Set<String>, Integer>,
Function2<Integer, Integer, Integer>{
private List<Set<String>> currentCandidatesSet;
public CountCandidates(final List<Set<String>> currentCandidatesSet) {
this.currentCandidatesSet = currentCandidatesSet;
}
#Override
public Iterable<Tuple2<Set<String>, Integer>> call(
Iterator<Tuple2<String, Set<String>>> arg0)
throws Exception {
List<Tuple2<Set<String>,Integer>> resultList =
new LinkedList<Tuple2<Set<String>,Integer>>();
for(Set<String> currCandidates : currentCandidatesSet){
Iterator<Tuple2<String,Set<String>>> iter = arg0;
while(iter.hasNext()){
Set<String> events = iter.next()._2;
if(events.containsAll(currCandidates)){
Tuple2<Set<String>, Integer> t =
new Tuple2<Set<String>, Integer>(currCandidates,1);
resultList.add(t);
}
}
}
return resultList;
}
#Override
public Integer call(Integer arg0, Integer arg1) throws Exception {
return arg0+arg1;
}
}
If iterator can not be reset in the function how can I iterate the parameter arg0 several times? I already tried some different ways as following code but it is also not working. The following code seems like 'resultList' has too many data than I expected.
while(arg0.hasNext()){
Set<String> events = arg0.next()._2;
for(Set<String> currentCandidates : currentCandidatesSet){
if(events.containsAll(currentCandidates)){
Tuple2<Set<String>, Integer> t =
new Tuple2<Set<String>, Integer>(currentCandidates,1);
resultList.add(t);
}
}
}
How can I solve it?
Thanks in advance for your answer and sorry for my poor english. If you don't understand my question please make a comment

An Iterator can't be 'reset' in plain Java or Scala, even. That's the nature of an Iterator. An Iterable is something that can provide you Iterators many times. Your code needs to be rewritten to accept an Iterable, if that's what you really want to do.

The hadoop iterator could theoretically be reset to the beginning if it was cloneable. Reseting to the beginning in a Mapreduce framework would be acceptable since you would still get to read the file from the beginning getting better overall speed. Reseting the iterator to a random point would be counter to the Mapreduce mind set because it would likely require random access from a file.
There is a ticket in Hadoop's Jira explaining why they chose not to make the iterator cloneable although it does indicate that it is possible that it would be since the values would not have to be stored in memory.

Related

Java does not de-compile correctly

Im developing for android and compiling with gradle from git using jitpack.io
Im trying to use this library from git for functional programming:
fj - functional programmming for Java 7
I ran the code and got errors even though everything is tested.
The problem is in the class GroupBy:
Source code:
public Collection<Group<S,T>> execute(Collection<T> collection){
Hashtable<S, Group<S, T>> groups = new Hashtable<S, Group<S, T>>();
for (T item: collection){
S classification = grouper.select(item);
if (!groups.contains(classification)){
groups.put(classification, new Group<S, T>(classification));
}
groups.get(classification).add(item);
}
return groups.values();
}
De-Compiled code:
public Collection<GroupBy.Group<S, T>> execute(Collection<T> collection) {
Hashtable groups = new Hashtable();
Object item;
Object classification;
for(Iterator var3 = collection.iterator(); var3.hasNext(); ((GroupBy.Group)groups.get(classification)).add(item)) {
item = var3.next();
classification = this.grouper.select(item);
if(!groups.contains(classification)) {
groups.put(classification, new GroupBy.Group(classification));
}
}
return groups.values();
}
I would appreciate any help.
Currently i dont see any reason why the code look different
Thanks

The short answer is that when java is complied information is lost. However the decompiled code functions exactly the same as the code you wrote.
Let's look at it line by line...
public Collection<GroupBy.Group<S, T>> execute(Collection<T> collection) {
This is the same, though it's given the Group class its full name.
Hashtable groups = new Hashtable();
Object item;
Object classification;
As you can see here the variable names and all the generic information is lost. Generics in java can be thought of as a hint to the compiler to check for errors. Once the compiler has finished compiling the information is thrown away (generally).
for(
Iterator var3 = collection.iterator();
var3.hasNext();
((GroupBy.Group)groups.get(classification)).add(item)
) {
The enhanced for loop has been replaced by a classic for loop. This is because in bytecode they are the same thing (though a smarter decompiler might have figured this out and written an enhanced for loop here).
The other interesting thing here is that the compiler has put the groups.get(...).add(...) statement inside your for loop. If you think about the contract of for(initialisation; termination; increment) then increment happens upon every loop iteration. So even though you wrote your statement inside the loop, it is the same effect. [There's probably a good reason for doing it this way, I'm not a compiler guru though so I can't say for certain].
item = var3.next();
classification = this.grouper.select(item);
if(!groups.contains(classification)) {
groups.put(classification, new GroupBy.Group(classification));
}
}
return groups.values();
}
The rest of the code is pretty much exactly what you wrote.

Iterator provided by the Hibernate Interceptor post flush method throws ConcurrentModificationException

I have extended the EmptyInterceptor provided by hibernate to perform some logic on post flush. The overwritten post flush method is provided with an iterator. When I tried to iterate, I received ConcurrentModificationException.
Below is my code snippet,
#Override
public void postFlush(Iterator entities) throws CallbackException
{
while (entities.hasNext())
{
Object entity;
try
{
entity = entities.next();
}
catch(ConcurrentModificationException e)
{
// I get concurrent modification exception while iterating.
return;
}
}
}
I am getting the below exception,
java.util.ConcurrentModificationException
at java.util.HashMap$HashIterator.nextEntry(HashMap.java:922)
at java.util.HashMap$ValueIterator.next(HashMap.java:950)
at org.hibernate.internal.util.collections.LazyIterator.next(LazyIterator.java:51)
at com.mycompany.MyInterceptor.postFlush(MyInterceptor.java:55)
at org.hibernate.event.internal.AbstractFlushingEventListener.postPostFlush(AbstractFlushingEventListener.java:401)
at org.hibernate.event.internal.DefaultAutoFlushEventListener.onAutoFlush(DefaultAutoFlushEventListener.java:70)
at org.hibernate.internal.SessionImpl.autoFlushIfRequired(SessionImpl.java:1130)
at org.hibernate.internal.SessionImpl.list(SessionImpl.java:1580)
at org.hibernate.internal.CriteriaImpl.list(CriteriaImpl.java:374)
From Hibernate Forum we can understand that the iterator passed to the postFlush() method is not thread safe causing ConcurrentModificationException.
Suggestions and solution to avoid the exception is appreciated.

If it's synchronization issue try using a ConcurrentHashMap instead of a plain HashMap
See also this answer i think it might help

Manually copy it in a List
#Override
public void postFlush(Iterator entities) {
super.postFlush(entities);
List<Object> objects = new ArrayList<>();
while (entities.hasNext()) {
objects.add(entities.next());
}
.
.
.
now you can use objects list

If you look at the implementation of IteratorUtils.toList, it just does:
List list = new ArrayList(estimatedSize);
while (iterator.hasNext()) {
list.add(iterator.next());
}
which isn't any faster than doing it that way, except... perhaps by allocating the list with an estimated size of 10, it is faster because it isn't necessarily having to re-allocate...

Copying iterator into a list via org.apache.commons.collections.IteratorUtils before iterating worked for me :
#Override
public void preFlush(Iterator entities) {
List list= IteratorUtils.toList(entities);
for(Object o : list){...}
}
However i can't explain why it is working when using IteratorUtils...

Collections.sort isn't sorting

I am building a web application using Java EE (although my problem is more Java based)
In a Servlet, I am getting a list of orders from the EJB. In this list of orders, there is a list of states for this order (sent, on dock, non received ...)
I want to sort this list of states by the date of the state. So I use Collections.sort like this:
for (Command c : commands) {
c.getStateList().sort(new Comparator<State>() {
#Override
public int compare(State o1, State o2) {
return o1.getStateDate().compareTo(o2.getStateDate());
}
});
c.getStateList().sort(Collections.reverseOrder());
}
request.setAttribute("commands", commands);
But when I display the results, the states are not sorted.
I tried to reverse the order as you can see, but it isn't working either.
As you can also see, I replaced the Collections.sort with the ListIWantToSort.sort. Still not working.
Any ideas on why it does not work or how I could repair it?
EDIT : Here is the getter for the list and its instanciation :
#OneToMany(cascade = CascadeType.ALL, mappedBy = "ciiCommande")
private List<Etat> etatList;
#XmlTransient
public List<Etat> getEtatList() {
return etatList;
}
List<Commande> commandes = new ArrayList<Commande>();
And I get my commands by a findAll Method.
To display them, I use that :
<c:forEach items="${commandes}" var="cmd">
<td>${cmd.etatList[0].codeStatut.libelleSituation}</td>
</c:forEach>

You are first sorting the list using the custom comparator. Then you are re-sorting it according to the reversed natural ordering of the elements - not the custom ordering you already applied. So the first sort is not taking effect as the list is re-ordered by the second sort. Note that Collections.reverseOrder() does not reverse the list - it is the reverse of the natural ordering (so the elements in getEtatList() must already be Comparable).
Try losing the second sort and doing:
c.getEtatList().sort(new Comparator<Etat>() {
#Override
public int compare(Etat o1, Etat o2) {
// Note o2/o1 reversed.
return o2.getDateEtat().compareTo(o1.getDateEtat());
}
});

Try:
for (Commande c : commandes) {
c.getEtatList().sort(Collections.reverseOrder(new Comparator<Etat>() {
#Override
public int compare(Etat o1, Etat o2) {
return o1.getDateEtat().compareTo(o2.getDateEtat());
}
}));
}
Since the sort method your using has been added to the List interface in Java SE 8, I guess you're using Java SE 8. Then you can rewrite it to the following:
commandes.foreach(c ->
c.getEtatList().sort(Comparator.comparing(Etat::getDateEtat).reversed());
);

This should be what you need:
Comparator<Etat> comparator = new Comparator<Etat>() {
#Override
public int compare(Etat o1, Etat o2) {
return o1.getDateEtat().compareTo(o2.getDateEtat());
}
};
for (Commande c : commandes) {
Collections.sort(c.getEtatList(), comparator);
// or this one: Collections.sort(c.getEtatList(), Collections.reverseOrder(comparator));
}

This works as expected, your problem is somewhere else:
public static void main(String[] args) {
List<State> states = Arrays.asList(new State(2015, 1, 1),
new State(2014, 1, 1),
new State(2016, 1, 1));
System.out.println(states); //not ordered
states.sort(new Comparator<State>() {
#Override public int compare(State o1, State o2) {
return o1.getStateDate().compareTo(o2.getStateDate());
}
});
System.out.println(states); //ordered
}
public static class State {
private final LocalDate stateDate;
public State(int year, int month, int day) {
this.stateDate = LocalDate.of(year, month, day);
}
public LocalDate getStateDate() { return stateDate; }
#Override public String toString() { return stateDate.toString(); }
}
Note that you seem to be using Java 8 and your comparator can be written:
states.sort(comparing(State::getStateDate));

After days of struggle, I managed to find a solution.
The list isn't sorted after every attempt I made. I still don't know why.
But I found an annotation, #OrderBy, that sorts the list the way I want.
Thank you all for your help, maybe one day this problem will be sorted out (see the pun ? I am so funny).
Cheers

I appreciate your question, as I have just experienced this. I implemented 'Comparable' (as I have done many other times) on my JPA Entity class. When doing a Collections.sort on myMainJPA_Object.getMyList(), the overriden comparable method does not get invoked.
My work-around has been to create a new List as an ArrayList (for example), do a .addAll(myObject.getMyList()), then do Collections.sort on that new list, and then the sort works (my comparable method is invoked on the sort). For example:
List<ObjectsToSort> tempList = new ArrayList<>();
tempList.addAll(jpaEntity.getListOfStuff());
Collections.sort(tempList);
//Then you could set the list again
jpaEntity.setListOfStuff(tempList);
I really don't like this solution, but I don't know any other way around it, and haven't found anything about this problem (until your post). I liked your #OrderBy annotation suggestion, in my case though I need to re-sort again on a different method call, so this solution works for me.

TreeSet Comparator

I have a TreeSet and a custom comparator.
I get the values from server according to the changes in the stock
ex: if time=0 then server will send all the entries on the stock (unsorted)
if time=200 then server will send entries added or deleted after the time 200(unsorted)
In client side i am sorting the entries. My question is which is more efficient
1> fetch all entries first and then call addAll method
or
2> add one by one
there can be millions of entries.
/////////updated///////////////////////////////////
private static Map<Integer, KeywordInfo> hashMap = new HashMap<Integer, KeywordInfo>();
private static Set<Integer> sortedSet = new TreeSet<Integer>(comparator);
private static final Comparator<Integer> comparator = new Comparator<Integer>() {
public int compare(Integer o1, Integer o2) {
int integerCompareValue = o1.compareTo(o2);
if (integerCompareValue == 0) return integerCompareValue;
KeywordInfo k1 = hashMap.get(o1);
KeywordInfo k2 = hashMap.get(o2);
if (null == k1.getKeyword()) {
if (null == k2.getKeyword())
return integerCompareValue;
else
return -1;
} else {
if (null == k2.getKeyword())
return 1;
else {
int compareString = AlphaNumericCmp.COMPARATOR.compare(k1.getKeyword().toLowerCase(), k2.getKeyword().toLowerCase());
//int compareString = k1.getKeyword().compareTo(k2.getKeyword());
if (compareString == 0)
return integerCompareValue;
return compareString;
}
}
}
};
now there is an event handler which gives me an ArrayList of updated entries,
after adding them to my hashMap i am calling
final Map<Integer, KeywordInfo> mapToReturn = new SubMap<Integer, KeywordInfo>(sortedSet, hashMap);

I think your bottleneck can be probably more network-related than CPU related. A bulk operation fetching all the new entries at once would be more network efficient.
With regards to your CPU, the time required to populate a TreeSet does not change consistently between multiple add()s and addAll(). The reason behind is that TreeSet relies on AbstractCollection's addAll() (http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b27/java/util/AbstractCollection.java#AbstractCollection.addAll%28java.util.Collection%29) which in turn creates an iterator and calls multiple times add().
So, my advice on the CPU side is: choose the way that keeps your code cleaner and more readable. This is probably obtained through addAll().

In general it is less memory overhead when on being loaded alread data is stored. This should be time efficient too, maybe using small buffers. Memory allocation costs time too.
However time both solutions, in a separate prototype. You really have to test with huge numbers, as network traffic costs much too. That is a bit Test Driven Development, and adds to QA both quantitative statistics, as correctness of implementation.

The actual implementation is a linked list, so add one by one will be faster if you do it right. And i think in the near future this behaviour wont be change.
For your problem a Statefull comparator may help.
// snipplet, must not work fine
public class NaturalComparator implements Comparator{
private boolean anarchy = false;
private Comparator parentComparator;
NaturalComparator(Comparator parent){
this.parentComparator = parent;
}
public void setAnarchy(){...}
public int compare(A a, A b){
if(anarchy) return 1
else return parentCoparator.compare(a,b);
}
}
...
Set<Integer> sortedSet = new TreeSet<Integer>(new NaturalComparator(comparator));
comparator.setAnarchy(true);
sortedSet.addAll(sorted);
comparator.setAnarchy(false);

Lambdas and putIfAbsent

I posted an answer here where the code demonstrating use of the putIfAbsent method of ConcurrentMap read:
ConcurrentMap<String, AtomicLong> map = new ConcurrentHashMap<String, AtomicLong> ();
public long addTo(String key, long value) {
// The final value it became.
long result = value;
// Make a new one to put in the map.
AtomicLong newValue = new AtomicLong(value);
// Insert my new one or get me the old one.
AtomicLong oldValue = map.putIfAbsent(key, newValue);
// Was it already there? Note the deliberate use of '!='.
if ( oldValue != newValue ) {
// Update it.
result = oldValue.addAndGet(value);
}
return result;
}
The main downside of this approach is that you have to create a new object to put into the map whether it will be used or not. This can have significant effect if the object is heavy.
It occurred to me that this would be an opportunity to use Lambdas. I have not downloaded Java 8 n'or will I be able to until it is official (company policy) so I cannot test this but would something like this be valid and effective?
public long addTo(String key, long value) {
return map.putIfAbsent( key, () -> new AtomicLong(0) ).addAndGet(value);
}
I am hoping to use the lambda to delay the evaluation of the new AtomicLong(0) until it is actually determined that it should be created because it does not exist in the map.
As you can see this is much more succinct and functional.
Essentially I suppose my questions are:
Will this work?
Or have I completely misinterpreted lambdas?
Might something like this work one day?

UPDATE 2015-08-01
The computeIfAbsent method as described below has indeed been added to Java SE 8. The semantics appear to be very close to the pre-release version.
In addition, computeIfAbsent, along with a whole pile of new default methods, has been added to the Map interface. Of course, maps in general can't support atomic updates, but the new methods add considerable convenience to the API.
What you're trying to do is quite reasonable, but unfortunately it doesn't work with the current version of ConcurrentMap. An enhancement is on the way, however. The new version of the concurrency library includes ConcurrentHashMapV8 which contains a new method computeIfAbsent. This pretty much allows you to do exactly what you're looking to do. Using this new method, your example could be rewritten as follows:
public long addTo(String key, long value) {
return map.computeIfAbsent( key, () -> new AtomicLong(0) ).addAndGet(value);
}
For further information about the ConcurrentHashMapV8, see Doug Lea's initial announcement thread on the concurrency-interest mailing list. Several messages down the thread is a followup message that shows an example very similar to what you're trying to do. (Note however the old lambda syntax. That message was from August 2011 after all.) And here is recent javadoc for ConcurrentHashMapV8.
This work is intended to be integrated into Java 8, but it hasn't yet as far as I can see. Also, this is still a work in progress, names and specs may change, etc.

AtomicLong is not really a heavy object. For heavier objects I would consider a lazy proxy and provide a lambda to that one to create the object if needed.
class MyObject{
void doSomething(){}
}
class MyLazyObject extends MyObject{
Funktion create;
MyLazyObject(Funktion create){
this.create = create;
}
MyObject instance;
MyObject getInstance(){
if(instance == null)
instance = create.apply();
return instance;
}
#Override void doSomething(){getInstance().doSomething();}
}
public long addTo(String key, long value) {
return map.putIfAbsent( key, new MyLazyObject( () -> new MyObject(0) ) );
}

Unfortunately it's not as easy as that. There are two main problems with the approach you've sketched out:
1. The type of the map would need to change from Map<String, AtomicLong> to Map<String, AtomicLongFunction> (where AtomicLongFunction is some function interface that has a single method that takes no arguments and returns an AtomicLong).
2. When you retrieve the element from the map you'd need to apply the function each time to get the AtomicLong out of it. This would result in creating a new instance each time you retrieve it, which is not likely what you wanted.
The idea of having a map that runs a function on demand to fill up missing values is a good one, though, and in fact Google's Guava library has a map that does exactly that; see their MapMaker. In fact that code would benefit from Java 8 lambda expressions: instead of
ConcurrentMap<Key, Graph> graphs = new MapMaker()
.concurrencyLevel(4)
.weakKeys()
.makeComputingMap(
new Function<Key, Graph>() {
public Graph apply(Key key) {
return createExpensiveGraph(key);
}
});
you'd be able to write
ConcurrentMap<Key, Graph> graphs = new MapMaker()
.concurrencyLevel(4)
.weakKeys()
.makeComputingMap((Key key) -> createExpensiveGraph(key));
or
ConcurrentMap<Key, Graph> graphs = new MapMaker()
.concurrencyLevel(4)
.weakKeys()
.makeComputingMap(this::createExpensiveGraph);

Note that using Java 8 ConcurrentHashMap it's completely unnecessary to have AtomicLong values. You can safely use ConcurrentHashMap.merge:
ConcurrentMap<String, Long> map = new ConcurrentHashMap<String, Long>();
public long addTo(String key, long value) {
return map.merge(key, value, Long::sum);
}
It's much simpler and also significantly faster.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to reset Iterator on MapReduce Function in Apache Spark - java

An Iterator can't be 'reset' in plain Java or Scala, even. That's the nature of an Iterator. An Iterable is something that can provide you Iterators many times. Your code needs to be rewritten to accept an Iterable, if that's what you really want to do.

Related

Java does not de-compile correctly

Iterator provided by the Hibernate Interceptor post flush method throws ConcurrentModificationException

Collections.sort isn't sorting

TreeSet Comparator

Lambdas and putIfAbsent

Categories

Resources