I'm running some experiments over a large dataset and would like to optimize a particular part. Currently, I have 5-6 Models each of which stores a mapping from Topics to List of Strings. The set of Topics is large and the same between each Model, so there must be a better way. Ultimately the query I need to perform is: what is the String in position x of the List for some Model-Topic combination.
One of the problems with using the mapping method is that if there are say 500k-5M topics, each has a list of 20 strings. Then my Map<Model, Map<Topic, List<String>>> is going to be massive.
Have you tried SortedSet / Maps? Sounds like you need to optimize your search, sorted collections (like TreeMap) should be log(n) while regular list is O(1). Of course, this kind of thing is something at which databases excel...
Not clear where/how you want to achieve "memory efficiency". First one needs to look at the particulars of your detailed data to see how much storage that consumes, then examine various ways of organizing it and analyze their efficiency in terms of % overhead vs your "real" data.
A brief glance shows that a HashMap, when you consider the associated tables, has about 80 bytes of overhead per entry. An ArrayList looks to average out around 10-12. Without looking, I would guess that a TreeMap would be more than a HashMap -- maybe 100.
Generally speaking, links within your own objects will be "cheaper", both in storage and speed to access, than links using these aggregating objects. But the aggregating objects are convenient to use, and have been "optimized" to a degree.
(But looking at your update, you probably should be looking at a DB application, rather than holding everything in heap.)
You could use Topic and Model to construct a composite key in a single Map, e.g.
map.put(topic1_id + model1_id, list1_1);
map.put(topic1_id + model2_id, list1_2);
...
map.get(topic_id + model_id)
where the IDs are Strings (or a similar scheme could be used with numeric identifiers).
A similar approach is to assign each topic and model a unique number, then store the lists of strings in arrays, so looking up the list for a given combination is a matter of looking up two indexes, then accessing a given location in a 2D array. (however, this is easier when you know the number of topics and models in advance of constructing the data structure)
For memory efficiency, also consider the small details. In general, you want to minimise the number of Objects - each Object carries an overhead. ArrayLists can have a lot of wasted space as they grow dynamically, doubling in size when they exceed their current capacity. If you can pre-size them to the required capacity (or use an array instead) then you can save a lot of memory. The same applies when using large numbers of small HashMaps.
One possible data structure is a hierarchy of maps, leading to an array of Strings. E.g.:
HashMap<Model, HashMap<Topic, String[]>> map;
A query function would then look like:
public String query(Model model, Topic topic, int x) {
HashMap<Topic, String[]> childMap = map.get(model);
if (childMap == null) {
return null;
}
String[] list = childMap.get(topic);
if (list == null) {
return null;
}
return list[x];
}
Presuming your Model and Topic structures implement hashCode() and equals() reasonably, the query performance should be quite good.
One potential weakness: I'm assuming you need to index a large number of Model/Topic combinations, and related lists of Strings (if not, you presumably wouldn't be asking about optimization). My guess is that the child String[] arrays will consume a large amount of memory. Each array is a Java object (about 20 bytes) + a pointer at each array location.
2 suggestions there:
1) If many Model/Topic combinations share the same set of Strings, you could gain quite a lot by sharing those String[] instances.
2) If you're using a 64-bit VM, be sure to use compressed ordinary object pointers (-XX:+UseCompressedOops). That will at least keep most of the pointers to 4 bytes instead of 8. Compressed OOPs is the default since 1.6.0_23, so a relatively recent VM will save you some memory here.
One other possibility not mentioned is store the strings using String[][][] and models and topics in a List such as ArrayList and then at query time:
public String query(Model model, Topic topic, int x) {
return strings[models.indexOf(model)][topics.indexOf(topic)][x];
}
It could be further improved for speed if the topics and models were sorted, then binary search rather than indexOf could be used.
Related
I have a JavaPairRDD<Integer, Integer[]> on which I want to perform a groupByKey action.
The groupByKey action gives me a:
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle
which is practically an OutOfMemory error, if I am not mistaken. This occurs only in big datasets (in my case when "Shuffle Write" shown in the Web UI is ~96GB).
I have set:
spark.serializer org.apache.spark.serializer.KryoSerializer
in $SPARK_HOME/conf/spark-defaults.conf, but I am not sure if Kryo is used to serialize my JavaPairRDD.
Is there something else that I should do to use Kryo, apart from setting this conf parameter, to serialize my RDD? I can see in the serialization instructions that:
Spark automatically includes Kryo serializers for the many commonly-used core Scala classes covered in the AllScalaRegistrar from the Twitter chill library.
and that:
Since Spark 2.0.0, we internally use Kryo serializer when shuffling RDDs with simple types, arrays of simple types, or string type.
I also noticed that when I set spark.serializer to be Kryo, the Shuffle Write in the Web UI increases from ~96GB (with default serializer) to 243GB!
EDIT: In a comment, I was asked about the logic of my program, in case groupByKey can be replaced with reduceByKey. I don't think it's possible, but here it is anyway:
Input has the form:
key: index bucket id,
value: Integer array of entity ids in this bucket
The shuffle write operation produces pairs in the form:
entityId
Integer array of all entity Ids in the same bucket (call them neighbors)
The groupByKey operation gathers all the neighbor arrays of each entity, some possibly appearing more than once (in many buckets).
After the groupByKey operation, I keep a weight for each bucket (based on the number of negative entity ids it contains) and for each neighbor id I sum up the weights of the buckets it belongs to.
I normalize the scores of each neighbor id with another value (let's say it's given) and emit the top-3 neighbors per entity.
The number of distinct keys that I get is around 10 million (around 5 million positive entity ids and 5 million negatives).
EDIT2: I tried using Hadoop's Writables (VIntWritable and VIntArrayWritable extending ArrayWritable) instead of Integer and Integer[], respectively, but the shuffle size was still bigger than the default JavaSerializer.
Then I increased the spark.shuffle.memoryFraction from 0.2 to 0.4 (even if deprecated in version 2.1.0, there is no description of what should be used instead) and enabled offHeap memory, and the shuffle size was reduced by ~20GB. Even if this does what the title asks, I would prefer a more algorithmic solution, or one that includes a better compression.
Short Answer: Use fastutil and maybe increase spark.shuffle.memoryFraction.
More details:
The problem with this RDD is that Java needs to store Object references, which consume much more space than primitive types. In this example, I need to store Integers, instead of int values. A Java Integer takes 16 bytes, while a primitive Java int takes 4 bytes. Scala's Int type, on the other hand, is a 32-bit (4-byte) type, just like Java's int, that's why people using Scala may not have faced something similar.
Apart from increasing the spark.shuffle.memoryFraction to 0.4, another nice solution was to use the fastutil library, as suggest in Spark's tuning documentation:
The first way to reduce memory consumption is to avoid the Java features that add overhead, such as pointer-based data structures and wrapper objects. There are several ways to do this: Design your data structures to prefer arrays of objects, and primitive types, instead of the standard Java or Scala collection classes (e.g. HashMap). The fastutil library provides convenient collection classes for primitive types that are compatible with the Java standard library.
This enables storing each element in int array of my RDD pair as an int type (i.e., using 4 bytes instead of 16 for each element of the array). In my case, I used IntArrayList instead of Integer[]. This made the shuffle size drop significantly and allowed my program to run in the cluster. I also used this library in other parts of the code, where I was making some temporary Map structures. Overall, by increasing spark.shuffle.memoryFraction to 0.4 and using fastutil library, shuffle size dropped from 96GB to 50GB (!) using the default Java serializer (not Kryo).
Alternative: I have also tried sorting each int array of an rdd pair and storing the deltas using Hadoop's VIntArrayWritable type (smaller numbers use less space than bigger numbers), but this also required to register VIntWritable and VIntArrayWritable in Kryo, which didn't save any space after all. In general, I think that Kryo only makes things work faster, but does not decrease the space needed, but I am not still sure about that.
I am not marking this answer as accepted yet, because someone else might have a better idea, and because I didn't use Kryo after all, as my OP was asking. I hope reading it, will help someone else with the same issue. I will update this answer, if I manage to further reduce the shuffle size.
Still not really sure what you want to do. However, because you use groupByKey and say that there is no way to do it by using reduceByKey, it makes me more confused.
I think you have rdd = (Integer, Integer[]) and you want something like (Integer, Iterable[Integer[]]) that's why you are using groupByKey.
Anyway, I am not really familiar with Java in Spark, but in Scala I would use reduceByKey to avoid the shuffle by
rdd.mapValues(Iterable(_)).reduceByKey(_++_) . Basically, you want to convert the value to a list of array and then combine the list together.
I think the best approach that can be recommended here (without more specific knowledge of the input data) in general is to use the persist API on your input RDD.
As step one, I'd try to call .persist(MEMORY_ONLY_SER) on the input, RDD to lower memory usage (albeit at a certain CPU overhead, that shouldn't be that much of a problem for ints in your case).
If that is not sufficient you can try out .persist(MEMORY_AND_DISK_SER) or if your shuffle still takes so much memory that the input dataset needs to be made easier on the memory .persist(DISK_ONLY) may be an option, but one that will strongly deteriorate performance.
I am working on a project where I am processing a lot of tweets; the goal is to remove duplicates as I process them. I have the tweet IDs, which come in as strings of the format "166471306949304320"
I have been using a HashSet<String> for this, which works fine for a while. But by the time I get to around 10 million items I am drastically bogged down and eventually get a GC error, presumably from the rehashing. I tried defining a better size/load with
tweetids = new HashSet<String>(220000,0.80F);
and that lets it get a little farther, but is still excruciatingly slow (by around 10 million it is taking 3x as long to process). How can I optimize this? Given that I have an approximate idea of how many items should be in the set by the end (in this case, around 20-22 million) should I create a HashSet that rehashes only two or three times, or would the overhead for such a set incur too many time-penalties? Would things work better if I wasn't using a String, or if I define a different HashCode function (which, in this case of a particular instance of a String, I'm not sure how to do)? This portion of the implementation code is below.
tweetids = new HashSet<String>(220000,0.80F); // in constructor
duplicates = 0;
...
// In loop: For(each tweet)
String twid = (String) tweet_twitter_data.get("id");
// Check that we have not processed this tweet already
if (!(tweetids.add(twid))){
duplicates++;
continue;
}
SOLUTION
Thanks to your recommendations, I solved it. The problem was the amount of memory required for the hash representations; first, HashSet<String> was simply enormous and uncalled for because the String.hashCode() is exorbitant for this scale. Next I tried a Trie, but it crashed at just over 1 million entries; reallocating the arrays was problematic. I used a HashSet<Long> to better effect and almost made it, but speed decayed and it finally crashed on the last leg of the processing (around 19 million). The solution came with departing from the standard library and using Trove. It finished 22 million records a few minutes faster than not checking duplicates at all. Final implementation was simple, and looked like this:
import gnu.trove.set.hash.TLongHashSet;
...
TLongHashSet tweetids; // class variable
...
tweetids = new TLongHashSet(23000000,0.80F); // in constructor
...
// inside for(each record)
String twid = (String) tweet_twitter_data.get("id");
if (!(tweetids.add(Long.parseLong(twid)))) {
duplicates++;
continue;
}
You may want to look beyond the Java collections framework. I've done some memory intensive processing and you will face several problems
The number of buckets for large hashmaps and hash sets is going to
cause a lot of overhead (memory). You can influence this by using
some kind of custom hash function and a modulo of e.g. 50000
Strings are represented using 16 bit characters in Java. You can halve that by using utf-8 encoded byte arrays for most scripts.
HashMaps are in general quite wasteful data structures and HashSets are basically just a thin wrapper around those.
Given that, take a look at trove or guava for alternatives. Also, your ids look like longs. Those are 64 bit, quite a bit smaller than the string representation.
An alternative you might want to consider is using bloom filters (guava has a decent implementation). A bloom filter would tell you if something is definitely not in a set and with reasonable certainty (less than 100%) if something is contained. That combined with some disk based solution (e.g. database, mapdb, mecached, ...) should work reasonably well. You could buffer up incoming new ids, write them in batches, and use the bloom filter to check if you need to look in the database and thus avoid expensive lookups most of the time.
If you are just looking for the existence of Strings, then I would suggest you try using a Trie(also called a Prefix Tree). The total space used by a Trie should be less than a HashSet, and it's quicker for string lookups.
The main disadvantage is that it can be slower when used from a harddisk as it's loading a tree, not a stored linearly structure like a Hash. So make sure that it can be held inside of RAM.
The link I gave is a good list of pros/cons of this approach.
*as an aside, the bloom filters suggested by Jilles Van Gurp are great fast prefilters.
Simple, untried and possibly stupid suggestion: Create a Map of Sets, indexed by the first/last N characters of the tweet ID:
Map<String, Set<String>> sets = new HashMap<String, Set<String>>();
String tweetId = "166471306949304320";
sets.put(tweetId.substr(0, 5), new HashSet<String>());
sets.get(tweetId.substr(0, 5)).add(tweetId);
assert(sets.containsKey(tweetId.substr(0, 5)) && sets.get(tweetId.substr(0, 5)).contains(tweetId));
That easily lets you keep the maximum size of the hashing space(s) below a reasonable value.
I am building a cache that has to store as much data as possible. CPU is not a mayor issue, because the next level of data is a lot more expessive to reach than running the CPUs a little bit for decompression.
I'm looking for a good strategy and not a full implemenation. A typical object instance that should be cached can be gernalized as a list of hashmaps. The keys in these map are very similiar to keys in another map in that list. Keys and values are strings.
Maps in different caching objects (this means also different lists) may not always have similar keys. Maybe only a subset (50%) of the keys is the same.
I was thinking of extracting the keys into ONE header array and each collection of values of the hashmap into another array with the same length. This means the data array might be sparse (null-pointers). But I don't have to carry the meta data around. The possition in the data array is the only way of looking up the correct key.
Now I want to compress the data array. Compression won't really work well on a single data array because there is little information. It will need a few data arrays stuck together to get a good compression rate.
Is there any good way of compressing String-Arrays in java? How many of these data arrays should I cluster for good results?
Is there maybe some better aporoach? This is a open questions for collecting ideas, so please feel free to elaborate :-)
Flyweight can help
If are not compressing you can use Flyweight pattern to avoid the cost of the string-key repeated in each object.
Remember a string is an object so a key in your hashmap is a reference to it. If a lot of objects with the same property use references to the same string object you only have 4-bytes for each reference and only one string in memory.
How to ensure you are sharing the string objects between objects? You can use something similar to String.intern(). But please don't use String.intern() itself.
Interning a string is returning the same string-object for the same string value. You must hold a cache for those strings. The reason I don't recommend String.intern() is that the cache is the String class itself so it nevers get freed. But you can implement something analogous.
This code returns your own string if it's new. And returns the first one if it's not.
HashMap<String,String> internedStrings = new HashMap<String,String>();
syncrhonized String returnUniqueString(String str) {
String alreadyCached = internedStrings.get(str);
if (alreadyCached == null) {
internedStrings.put(str, str);
alreadyCached = str;
}
return alreadyCached;
}
But if you are compressing, not
Because compressing means you are serializing your object graphs and each property name will be serialized as a different string, so repeating itself. Maybe the compressed size doesn't grow too much because it's a repeated string but when you rehidrate the objects they will be created separately.
Maybe you can use the returnUniqueString at the time of rehidrating :)
This sounds like a good approach.
However, i suggest you consider a different way of breaking the map values into lists: rather than making a list for each map, make a list for each different key, containing the values for that key for each item.
For example, if your maps are:
1: {
colour: red,
size: small,
},
2: {
colour: blue,
flavour: strawberry
},
3: {
colour: red,
size: large,
flavour: strawberry
}
Then you decompose into:
colour: [red, blue, red]
size: [small, null, large]
flavour: [null, strawberry, strawberry]
This may seem a bit weird, but the point is that you cluster values of the same type together, which will help the compression be more efficient.
This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
What exactly are hashtables?
I understand the purpose of using hash functions to securely store passwords. I have used arrays and arraylists for class projects for sorting and searching data. What I am having trouble understanding is the practical value of hashtables for something like sorting and searching.
I got a lecture on hashtables but we never had to use them in school, so it hasn't clicked. Can someone give me a practical example of a task a hashtable is useful for that couldn't be done with a numerical array or arraylist? Also, a very simple low level example of a hash function would be helpful.
There are all sorts of collections out there. Collections are used for storing and retrieving things, so one of the most important properties of a collection is how fast these operations are. To estimate "fastness" people in computer science use big-O notation which sort of means how many individual operations you have to accomplish to invoke a certain method (be it get or set for example). So for example to get an element of an ArrayList by an index you need exactly 1 operation, this is O(1), if you have a LinkedList of length n and you need to get something from the middle, you'll have to traverse from the start of the list to the middle, taking n/2 operations, in this case get has complexity of O(n). The same comes to key-value stores as hastable. There are implementations that give you complexity of O(log n) to get a value by its key whereas hastable copes in O(1). Basically it means that getting a value from hashtable by its key is really cheap.
Basically, hashtables have similar performance characteristics (cheap lookup, cheap appending (for arrays - hashtables are unordered, adding to them is cheap partly because of this) as arrays with numerical indices, but are much more flexible in terms of what the key may be. Given a continuous chunck of memory and a fixed size per item, you can get the adress of the nth item very easily and cheaply. That's thanks to the indices being integers - you can't do that with, say, strings. At least not directly. Hashes allows reducing any object (that implements it) to a number and you're back to arrays. You still need to add checks for hash collisions and resolve them (which incurs mostly a memory overhead, since you need to store the original value), but with a halfway decent implementation, this is not much of an issue.
So you can now associate any (hashable) object with any (really any) value. This has countless uses (although I have to admit, I can't think of one that's applyable to sorting or searching). You can build caches with small overhead (because checking if the cache can help in a given case is O(1)), implement a relatively performant object system (several dynamic languages do this), you can go through a list of (id, value) pairs and accumulate the values for identical ids in any way you like, and many other things.
Very simple. Hashtables are often called "associated arrays." Arrays allow access your data by index. Hash tables allow access your data by any other identifier, e.g. name. For example
one is associated with 1
two is associated with 2
So, when you got word "one" you can find its value 1 using hastable where key is one and value is 1. Array allows only opposite mapping.
For n data elements:
Hashtables allows O(k) (usually dependent only on the hashing function) searches. This is better than O(log n) for binary searches (which follow an n log n sorting, if data is not sorted you are worse off)
However, on the flip side, the hashtables tend to take roughly 3n amount of space.
I have a collection of strings that I want to filter. They'll be in this pattern:
xxx_xxx_xxx_xxx
so always a sequence of letters or numbers separated by three underscores. The max length of each string will be 60 characters. I might have a few million of these in my collection.
What data structure could I use to efficiently do something like this:
Get all strings starts with: "abc_123_456"
Get all strings starts with: "def_999_888"
etc..
for example, I could do this:
List<String> matched = new ArrayList<String>();
for (String it : strings) {
if (it.startsWith(match)) {
matched.add(it);
}
}
but that would take a long time if my collection is on the order of millions of strings, and worse yet if the number of matched strings is also high.
The high-level problem is that I want to answer the following question for an app I'm writing: "which of my friends have recommended product A for product B?". I could store this information in a sql table and run the following statement:
select recommender from recs where username='me' and prodIdA='a' and prodIdB='b';
I'm curious if something custom in java/C/C++ could run faster, using encoded flat strings like I have above:
myusername_prodIdA_prodIdB_recommenderusername
The idea being that you could do a starts-with operation on the whole collection of encoded strings to get your answer.
I know trying to implement a custom solution like this is most likely not usable in a production environment, so some sql db would be better, just curious though,
Thanks
To do that in Java, you can use a Trie structure.
That being said, I don't think it's a good idea. Dumping "a few million" records in the memory won't always work.
That's what databases are for; with the right design and proper indexing you can have very good performance with the DB alone.
I think you are looking for a SortedMap.
"headMap(K toKey)
Returns a view of the portion of this map whose keys are strictly less than toKey."
I know trying to implement a custom solution like this is most likely not usable in a production environment, so some sql db would be better, just curious though
If only for the sake of curiosity, you can put all existing different "myusername_prodIdA_prodIdB" combinations in hashtable. And for each combination store a list of relevant results.
So, the structure would look like Map<String, List<String>> and used like hash.get("def_999_888"). Constant time (O(1))
You can get rid of inner list and optimize it in many ways, but this is the idea.
The first thing that comes to mind for me is pre-processing the strings into some sort of data structure so that they could be searched for efficiently. If you're going to be calling the search function many times, I think it'd be good for you to put all of the strings into a hash table for a constant-time look up. It'd take more processing power to construct your array of strings, but it'd trivialize the task of searching for them.