Edit: Already solved using RDD.collectAsMap()
I am trying to replicate the solution to the problem from pages 28-30 of http://on-demand.gputechconf.com/gtc/2016/presentation/S6424-michela-taufer-apache-spark.pdf
I have a HashMap that I instantiate outside of the map function. The HashMap contains the following data:
{1:2, 2:3, 3:2, 4:2, 5:3}
A previously defined RDD previousRDD was has the type:
JavaPairRDD<Integer, Iterable<Tuple2<Integer, Integer>>>
has the data:
1: [(1,2), (1,5)]
2: [(2,1), (2,3), (2,5)]
3: [(3,2), (3,4)]
4: [(4,3), (4,5)]
5: [(5,1), (5,2), (5,4)]
I try to create a new RDD with a flatMapToPair:
JavaPairRDD<Integer, Integer> newRDD = previousRDD.flatMapToPair(new PairFlatMapFunction<Tuple2<Integer, Iterable<Tuple2<Integer, Integer>>>, Integer, Integer>() {
#Override
public Iterator<Tuple2<Integer, Integer>> call(Tuple2<Integer, Iterable<Tuple2<Integer, Integer>>> integerIterableTuple2) throws Exception {
Integer count;
ArrayList<Tuple2<Integer, Integer>> list = new ArrayList<>();
count = hashMap.get(integerIterableTuple2._1);
for (Tuple2<Integer, Integer> t : integerIterableTuple2._2) {
Integer tcount = hashMap.get(t._2);
if (count < tcount || (count.equals(tcount) && integerIterableTuple2._1 < t._2)) {
list.add(t);
}
}
return list.iterator();
}
});
But in this, the hashMap.get(t._2) inside the for loop gets NULLs most of the time. I have checked that the proper values are inside the HashMap.
Is there a way to properly get the values of a HashMap inside a Spark function?
It should work. Spark should capture your variable, serialize it and send to each worker with each task. You might try broadcasting this map
sc.broadcast(hashMap)
and use the result instead of hashMap. It is more efficient memory-wise too (shared storage per executor).
I had similar problem with class variables. You can try make your variable local or declare one more, like this:
Map localMap = hashMap;
JavaPairRDD<Integer, Integer> newRDD = previousRDD.flatMapToPair(
...
Integer tcount = localMap.get(t._2);
...
);
I think this is due to spark serialization mechanism. You can read more about it here.
Related
I have a class with two methods: the startAPI() calls the API classes to extract entities and returns the entities and the occurrence of the entities. I need this return value in two different methods from another class, but as soon as I call the second method (countApiOcc()) the map I pass is empty. How can I use the returned map in two different methods?
public class Topic {
public void calculateNoFeedback(String language, List<String> api, List<String> corr) {
Map<String, Object> apis = startAPI(api, textList);
CountTopics countT = new CountTopics();
ArrayList<String> topics = countT.getTopics(apis);
countT.countApiOcc(topics, apis);
}
public Map<String, Object> startAPI(List<String> selectedAPI, List<String> text) {
Map<String, Object> apisValues = new HashMap<String, Object>();
//do stuff to extract the entities and return entities
return apisValues;
}
}
The CountTopic() class looks as follows, and, explained in short, user can select which or how many APIs he wants to use to extract entities and in the class CountTopic() the method getTopics() should find the topics every selected API found and countApiOcc() I need the frequency of the selected entities (all of this works) it is just the map I need in the second method.
public ArrayList<String> getTopics(Map<String, Object> apiV) {
System.out.println("apiV: "+apiV);
Iterator iterator = apiV.entrySet().iterator();
mapSize = apiV.size();
System.out.println("Size of the map: "+ mapSize);
while (iterator.hasNext()) {
Map.Entry entries = (Map.Entry)iterator.next();
String key = entries.getKey().toString();
switch(key) {
case "valuesMS":
Map<String, Object> mapMicrosoft = (Map<String, Object>) apiV.get(key);
ArrayList<String> microsoft = (ArrayList<String>) mapMicrosoft.get("topicArrayMS");
microsoftTopicLowerCase.addAll(microsoft);
topicsMultiset.addAll(microsoft);
break;
case "valuesGate":
Map<String, Object> mapGate = (Map<String, Object>) apiV.get(key);
ArrayList<String> gate = (ArrayList<String>) mapGate.get("topicArrayGA");
//store the values for finding the topics which are found from every selected API
//store the values from the api to lower case to find the index later (needed for how often this api found the topic
gateTopicLowerCase.addAll(gate);
topicsMultiset.addAll(gate);
break;
}
iterator.remove();
}
//rest code: compare the Arrays to find the same topics
iterator.remove();
There's your culprit. You're emptying your map. Don't do this, and it will work. From my limited view on your code, there doesn't seem to be any reason to modify your map. But in case it would be necessary, you should make a copy of the map at the beginning of your method, and work on this copy. Generally it's a bad idea to modify your input parameters, unless that is the specific purpose of that method.
This is my data in java it will take lot of time for processing and getting response if data is huge how to optimize this code for getting fast action
List<EventLog> eventLogs = new ArrayList<EventLog>();
List<EventLog> eventLogData = get(currentUser, data);
Map<String, Integer> map = new HashMap<String, Integer>();
for (EventLog rep : eventLogData) {
if (map.containsKey(rep.getEventType())) {
map.put(rep.getEventType(), map.get(rep.getEventType()) + 1);
} else {
map.put(rep.getEventType(), 1);
}
}
for (Map.Entry<String, Integer> entry : map.entrySet()) {
EventLog list = new EventLog();
list.setEventType(entry.getKey());
list.setCount(entry.getValue());
eventLogs.add(list);
}
return eventLogs;
Have you considered using an alternative high performance collections library?
Here is my implementation, using the Trove library. It provides a much faster execution since
The map is update requires a single access (adjustOrPutValue) instead of the two in the original implementation (containsKey and get)
It handle primitive int's instead of Integer's and avoids all the boxing and unboxing operations, and consumes less memory
.
List<EventLog> eventLogData = get(currentUser, data);
TObjectIntMap<String> map = new TObjectIntHashMap<>();
for (EventLog rep : eventLogData) {
map.adjustOrPutValue(rep.getEventType(), 1, 1);
}
List<EventLog> eventLogs = new ArrayList<>();
map.forEachEntry(new TObjectIntProcedure<String>() {
#Override
public boolean execute(String key, int value) {
EventLog list = new EventLog();
list.setEventType(key);
list.setCount(value);
eventLogs.add(list);
return true;
}
});
return eventLogs;
Modify your code in one loop like this you can use for instead of foreach loop get the size of list using list.size():
for(int a=0, b=5 ; a<=5 ; a++,b--){
// do your stuff here
}
I have a map that will be filled in a matter of time. problem is I want to know what the last added entry is. so far I only found the last entry in the map. is there a way to get the last added entry?
code so far:
int spawned = 0;
NavigableMap<String, Integer> minioncounter = new TreeMap<String, Integer>();
while (spawned < 7) {
if(!minioncounter.containsKey("big")){
minioncounter.put("big", 1);
}else if(!minioncounter.containsKey("small")){
minioncounter.put("small", 1);
}else if(minioncounter.containsKey("small") && minioncounter.get("small") < 2){
minioncounter.put("small", 2);
}else if(!minioncounter.containsKey("archer")){
minioncounter.put("archer", 1);
}else{
minioncounter.put("archer", minioncounter.get("archer")+1);
}
spawned++;
System.out.println(minioncounter.);
System.out.println(minioncounter);
}
Current console output:
{big=1}
{big=1, small=1}
{big=1, small=2}
{archer=1, big=1, small=2}
{archer=2, big=1, small=2}
{archer=3, big=1, small=2}
{archer=4, big=1, small=2}
the order in which it is already stated is the one I have to use later on.
See LinkedHashMap.
This Map implementation maintains keys in the order in which they were inserted (basically). That said, this may not meet your specific needs, I'd read the documentation.
It's simple enough to extend an existing implementation to provide even more control, though.
You can create your own StoreLastAddMap class that wraps the real NavigableMap. You expose the put method in your class where you will update the reference to the last added entry before calling the wrapped NavigableMap's add method.
public class StoreLastAddMap () {
NavigableMap<String, Integer> minioncounter = new TreeMap<String, Integer>();
private String lastAddedKey;
put(String key, Integer val) {
lastAddedKey = key;
minioncounter.put(key, val);
}
//getter for the wrapped Map to do other Map related stuff
NavigableMap getMap() {return minioncounter;}
Integer getLastAddedVal(){return minioncounter.get(lastAddedKey);}
String getLastAddedKey() {return lastAddedKey;}
}
Or something to that affect.
The problem I have is an example of something I've seen often. I have a series of strings (one string per line, lets say) as input, and all I need to do is return how many times each string has appeared. What is the most elegant way to solve this, without using a trie or other string-specific structure? The solution I've used in the past has been to use a hashtable-esque collection of custom-made (String, integer) objects that implements Comparable to keep track of how many times each string has appeared, but this method seems clunky for several reasons:
1) This method requires the creation of a comparable function which is identical to the String's.compareTo().
2) The impression that I get is that I'm misusing TreeSet, which has been my collection of choice. Updating the counter for a given string requires checking to see if the object is in the set, removing the object, updating the object, and then reinserting it. This seems wrong.
Is there a more clever way to solve this problem? Perhaps there is a better Collections interface I could use to solve this problem?
Thanks.
One posibility can be:
public class Counter {
public int count = 1;
}
public void count(String[] values) {
Map<String, Counter> stringMap = new HashMap<String, Counter>();
for (String value : values) {
Counter count = stringMap.get(value);
if (count != null) {
count.count++;
} else {
stringMap.put(value, new Counter());
}
}
}
In this way you still need to keep a map but at least you don't need to regenerate the entry every time you match a new string, you can access the Counter class, which is a wrapper of integer and increase the value by one, optimizing the access to the array
TreeMap is much better for this problem, or better yet, Guava's Multiset.
To use a TreeMap, you'd use something like
Map<String, Integer> map = new TreeMap<>();
for (String word : words) {
Integer count = map.get(word);
if (count == null) {
map.put(word, 1);
} else {
map.put(word, count + 1);
}
}
// print out each word and each count:
for (Map.Entry<String, Integer> entry : map.entrySet()) {
System.out.printf("Word: %s Count: %d%n", entry.getKey(), entry.getValue());
}
Integer theCount = map.get("the");
if (theCount == null) {
theCount = 0;
}
System.out.println(theCount); // number of times "the" appeared, or null
Multiset would be much simpler than that; you'd just write
Multiset<String> multiset = TreeMultiset.create();
for (String word : words) {
multiset.add(word);
}
for (Multiset.Entry<String> entry : multiset.entrySet()) {
System.out.printf("Word: %s Count: %d%n", entry.getElement(), entry.getCount());
}
System.out.println(multiset.count("the")); // number of times "the" appeared
You can use a hash-map (no need to "create a comparable function"):
Map<String,Integer> count(String[] strings)
{
Map<String,Integer> map = new HashMap<String,Integer>();
for (String key : strings)
{
Integer value = map.get(key);
if (value == null)
map.put(key,1);
else
map.put(key,value+1);
}
return map;
}
Here is how you can use this method in order to print (for example) the string-count of your input:
Map<String,Integer> map = count(input);
for (String key : map.keySet())
System.out.println(key+" "+map.get(key));
You can use a Bag data structure from the Apache Commons Collection, like the HashBag.
A Bag does exactly what you need: It keeps track of how often an element got added to the collections.
HashBag<String> bag = new HashBag<>();
bag.add("foo");
bag.add("foo");
bag.getCount("foo"); // 2
I want to store all values of a certain variable in a dataset and the frequency for each of these values. To do so, I use an ArrayList<String> to store the values and an ArrayList<Integer> to store the frequencies (since I can't use int). The number of different values is unknown, that's why I use ArrayList and not Array.
Example (simplified) dataset:
a,b,c,d,b,d,a,c,b
The ArrayList<String> with values looks like: {a,b,c,d} and the ArrayList<Integer> with frequencies looks like: {2,3,2,2}.
To fill these ArrayLists I iterate over each record in the dataset, using the following code.
public void addObservation(String obs){
if(values.size() == 0){// first value
values.add(obs);
frequencies.add(new Integer(1));
return;//added
}else{
for(int i = 0; i<values.size();i++){
if(values.get(i).equals(obs)){
frequencies.set(i, new Integer((int)frequencies.get(i)+1));
return;//added
}
}
// only gets here if value of obs is not found
values.add(obs);
frequencies.add(new Integer(1));
}
}
However, since the datasets I will use this for can be very big, I want to optimize my code, and using frequencies.set(i, new Integer((int)frequencies.get(i)+1)); does not seem very efficient.
That brings me to my question; how can I optimize the updating of the Integer values in the ArrayList?
Use a HashMap<String,Integer>
Create the HashMap like so
HashMap<String,Integer> hm = new HashMap<String,Integer>();
Then your addObservation method will look like
public void addObservation(String obs) {
if( hm.contains(obs) )
hm.put( obs, hm.get(obs)+1 );
else
hm.put( obs, 1 );
}
I would use a HashMap or a Hashtable as tskzzy suggested. Depending on your needs I would also create an object that has the name, count as well as other metadata that you might need.
So the code would be something like:
Hashtable<String, FrequencyStatistics> statHash = new Hashtable<String, FrequencyStatistics>();
for (String value : values) {
if (statHash.get(value) == null) {
FrequencyStatistics newStat = new FrequencyStatistics(value);
statHash.set(value, newStat);
} else {
statHash.get(value).incrementCount();
}
}
Now, your FrequencyStatistics objects constructor would automatically set its inital count to 1, while the incrementCound() method would increment the count, and perform any other statistical calculations that you might require. This should also be more extensible in the future than storing a hash of the String with only its corresponding Integer.