This question already has answers here:
HashSet of Strings taking up too much memory, suggestions...?
(8 answers)
Closed 7 years ago.
I need to store a large dictionary of natural language words -- up to 120,000, depending on the language. These need to be kept in memory as profiling has shown that the algorithm which utilises the array is the time bottleneck in the system. (It's essentially a spellchecking/autocorrect algorithm, though the details don't matter.) On Android devices with 16MB memory, the memory overhead associated with Java Strings is causing us to run out of space. Note that each String has a 38 byte overhead associated with it, which gives up to a 5MB overhead.
At first sight, one option is to substitute char[] for String. (Or even byte[], as UTF-8 is more compact in this case.) But again, the memory overhead is an issue: each Java array has a 32 byte overhead.
One alternative to ArrayList<String>, etc. is to create an class with much the same interface that internally concatenates all the strings into one gigantic string, e.g. represented as a single byte[], and then store offsets into that huge string. Each offset would take up 4 bytes, giving a much more space-efficient solution.
My questions are a) are there any other solutions to the problem with similarly low overheads* and b) is any solution available off-the-shelf? Searching through the Guava, trove and PCJ collection libraries yields nothing.
*I know one can get the overhead down below 4 bytes, but there are diminishing returns.
NB. Support for Compressed Strings being Dropped in HotSpot JVM? suggests that the JVM option -XX:+UseCompressedStrings isn't going to help here.
I had to develop a word dictionary for a class project. we ended up using a trie as a data structure. Not sure the size difference between an arrraylist and a trie, but the performance is a lot better.
Here are some resources that could be helpful.
https://en.wikipedia.org/wiki/Trie
https://www.topcoder.com/community/data-science/data-science-tutorials/using-tries/
Related
I have a JavaPairRDD<Integer, Integer[]> on which I want to perform a groupByKey action.
The groupByKey action gives me a:
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle
which is practically an OutOfMemory error, if I am not mistaken. This occurs only in big datasets (in my case when "Shuffle Write" shown in the Web UI is ~96GB).
I have set:
spark.serializer org.apache.spark.serializer.KryoSerializer
in $SPARK_HOME/conf/spark-defaults.conf, but I am not sure if Kryo is used to serialize my JavaPairRDD.
Is there something else that I should do to use Kryo, apart from setting this conf parameter, to serialize my RDD? I can see in the serialization instructions that:
Spark automatically includes Kryo serializers for the many commonly-used core Scala classes covered in the AllScalaRegistrar from the Twitter chill library.
and that:
Since Spark 2.0.0, we internally use Kryo serializer when shuffling RDDs with simple types, arrays of simple types, or string type.
I also noticed that when I set spark.serializer to be Kryo, the Shuffle Write in the Web UI increases from ~96GB (with default serializer) to 243GB!
EDIT: In a comment, I was asked about the logic of my program, in case groupByKey can be replaced with reduceByKey. I don't think it's possible, but here it is anyway:
Input has the form:
key: index bucket id,
value: Integer array of entity ids in this bucket
The shuffle write operation produces pairs in the form:
entityId
Integer array of all entity Ids in the same bucket (call them neighbors)
The groupByKey operation gathers all the neighbor arrays of each entity, some possibly appearing more than once (in many buckets).
After the groupByKey operation, I keep a weight for each bucket (based on the number of negative entity ids it contains) and for each neighbor id I sum up the weights of the buckets it belongs to.
I normalize the scores of each neighbor id with another value (let's say it's given) and emit the top-3 neighbors per entity.
The number of distinct keys that I get is around 10 million (around 5 million positive entity ids and 5 million negatives).
EDIT2: I tried using Hadoop's Writables (VIntWritable and VIntArrayWritable extending ArrayWritable) instead of Integer and Integer[], respectively, but the shuffle size was still bigger than the default JavaSerializer.
Then I increased the spark.shuffle.memoryFraction from 0.2 to 0.4 (even if deprecated in version 2.1.0, there is no description of what should be used instead) and enabled offHeap memory, and the shuffle size was reduced by ~20GB. Even if this does what the title asks, I would prefer a more algorithmic solution, or one that includes a better compression.
Short Answer: Use fastutil and maybe increase spark.shuffle.memoryFraction.
More details:
The problem with this RDD is that Java needs to store Object references, which consume much more space than primitive types. In this example, I need to store Integers, instead of int values. A Java Integer takes 16 bytes, while a primitive Java int takes 4 bytes. Scala's Int type, on the other hand, is a 32-bit (4-byte) type, just like Java's int, that's why people using Scala may not have faced something similar.
Apart from increasing the spark.shuffle.memoryFraction to 0.4, another nice solution was to use the fastutil library, as suggest in Spark's tuning documentation:
The first way to reduce memory consumption is to avoid the Java features that add overhead, such as pointer-based data structures and wrapper objects. There are several ways to do this: Design your data structures to prefer arrays of objects, and primitive types, instead of the standard Java or Scala collection classes (e.g. HashMap). The fastutil library provides convenient collection classes for primitive types that are compatible with the Java standard library.
This enables storing each element in int array of my RDD pair as an int type (i.e., using 4 bytes instead of 16 for each element of the array). In my case, I used IntArrayList instead of Integer[]. This made the shuffle size drop significantly and allowed my program to run in the cluster. I also used this library in other parts of the code, where I was making some temporary Map structures. Overall, by increasing spark.shuffle.memoryFraction to 0.4 and using fastutil library, shuffle size dropped from 96GB to 50GB (!) using the default Java serializer (not Kryo).
Alternative: I have also tried sorting each int array of an rdd pair and storing the deltas using Hadoop's VIntArrayWritable type (smaller numbers use less space than bigger numbers), but this also required to register VIntWritable and VIntArrayWritable in Kryo, which didn't save any space after all. In general, I think that Kryo only makes things work faster, but does not decrease the space needed, but I am not still sure about that.
I am not marking this answer as accepted yet, because someone else might have a better idea, and because I didn't use Kryo after all, as my OP was asking. I hope reading it, will help someone else with the same issue. I will update this answer, if I manage to further reduce the shuffle size.
Still not really sure what you want to do. However, because you use groupByKey and say that there is no way to do it by using reduceByKey, it makes me more confused.
I think you have rdd = (Integer, Integer[]) and you want something like (Integer, Iterable[Integer[]]) that's why you are using groupByKey.
Anyway, I am not really familiar with Java in Spark, but in Scala I would use reduceByKey to avoid the shuffle by
rdd.mapValues(Iterable(_)).reduceByKey(_++_) . Basically, you want to convert the value to a list of array and then combine the list together.
I think the best approach that can be recommended here (without more specific knowledge of the input data) in general is to use the persist API on your input RDD.
As step one, I'd try to call .persist(MEMORY_ONLY_SER) on the input, RDD to lower memory usage (albeit at a certain CPU overhead, that shouldn't be that much of a problem for ints in your case).
If that is not sufficient you can try out .persist(MEMORY_AND_DISK_SER) or if your shuffle still takes so much memory that the input dataset needs to be made easier on the memory .persist(DISK_ONLY) may be an option, but one that will strongly deteriorate performance.
I was under the impression that StringBuffer is the fastest way to concatenate strings, but I saw this Stack Overflow post saying that concat() is the fastest method. I tried the 2 given examples in Java 1.5, 1.6 and 1.7 but I never got the results they did. My results are almost identical to this
Can somebody explain what I don't understand here? What is truly the fastest way to concatenate strings in Java?
Is there a different answer when one seeks the fastest way to concatenate two strings and when concatenating multiple strings?
String.concat is faster than the + operator if you are concatenating two strings... Although this can be fixed at any time and may even have been fixed in java 8 as far as I know.
The thing you missed in the first post you referenced is that the author is concatenating exactly two strings, and the fast methods are the ones where the size of the new character array is calculated in advance as str1.length() + str2.length(), so the underlying character array only needs to be allocated once.
Using StringBuilder() without specifying the final size, which is also how + works internally, will often need to do more allocations and copying of the underlying array.
If you need to concatenate a bunch of strings together, then you should use a StringBuilder. If it's practical, then precompute the final size so that the underlying array only needs to be allocated once.
What I understood from others answer is following:
If you need thread safety, use StringBuffer
If you do not need thread safety:
If strings are known before hand and for some reasons multiple time same code needs to be run, use '+' as compiler will optimize and handle it during compile time itself.
if only two strings need to be concatenated, use concat() as it will not require StringBuilder/StringBuffer objects to be created. Credits to #nickb
If multiple strings need to be concatenated, use StringBuilder.
Joining very long lists os strings by naively addinging them from start to end is very slow: the padded buffer grows incrementally, and is reallocated again and again, making additional copies (and sollicitating a lot the garbage collector).
The most efficient way to join long lists is to always start by joining pairs of adjascent strings whose total length is the smallest from ALL other candidate pairs; however this would require a complex lookup to find the optimal pair (similar to the wellknown problem of Hanoi towers), and finding it only to reduce the numebr of copies to the strict minimum would slow down things.
What you need a smart algorithm using a "divide and conquer" recursive algorithm with a good heuristic which is very near from this optimum:
If you have no string to join, return the empty string.
If you have only 1 string to join, just return it.
Otherwise if you have only 2 strings to join, join them and return the result.
Compute the total length of the final result.
Then determine the number of strings to join from the left until it reaches half of this total to determine the "divide" point splitting the set of strings in two non-empty parts (each part must contain at least 1 string, the division point cannot be the 1st or last string from te set to join).
Join the smallest part if it has at least 2 strings to join, otherwise join the other part (using this algorithm recursively).
Loop back to the beginning (1.) to complete the other joins.
Note that empty strings in the collection have to be ignored as if they were not part of the set.
Many default implementations of String.join(table of string, optional separator) found in various libraries are slow as they are using naive incremental joinining from left to right; the divide-and-conquer algorithm above will outperform it, when you need to join MANY small string to generate a very large string.
Such situation is not exceptional, it occurs in text preprocessors and generators, or in HTML processing (e.g. in "Element.getInnerText()" when the element is a large document containing many text elements separated or contained by many named elements).
The strategy above works when the source strings are all (or almost all to be garbage collected to keep only the final result. If th result is kept together as long as the list of source strings, the best alternative is to allocate the final large buffer of the result only once for its total length, then copy source strings from left to right.
In both cases, this requires a first pass on all strings to compute their total length.
If you usse a reallocatable "string buffer", this does not work well if the "string buffer" reallocates constantly. However, the string buffer may be useful when performing the first pass, to prejoin some short strings that can fit in it, with a reasonnable (medium) size (e.g. 4KB for one page of memory): once it is full, replace the subset of strings by the content of the string buffer, and allocate a new one.
This can considerably reduce the number of small strings in the source set, and after the first pass, you have the total length for the final buffer to allocate for the result, where you'll copy incrementally all the remaining medium-size strings collected in the first pass This works very well when the list of source strings come from a parser function or generator, where the total length is not fully known before the end of parsing/generation: you'll use only intermediate stringbuffers with medium size, and finally you'll generate the final buffer without reparsing again the input (to get many incremental fragments) or without calling the generator repeatedly (this would be slow or would not work for some generators or if the input of the parser is consumed and not recoverable from the start).
Note that this remarks also applies not just to joinind strings, but also to file I/O: writing the file incrementally also suffers from reallocation or fragmentation: you should be able to precompute the total final length of the generated file. Otherwise you need a classig buffer (implemented in most file I/O libraries, and usually sized in memory at about one memory page of 4KB, but you should allocate more because file I/O is considerably slower, and fragmentation becomes a performance problem for later file accesses when file fragments are allocated incrementalyy by too small units of just one "cluster"; using a buffer at about 1MB will avoid most pperformance problems caused by fragmented allocation on the file system as fragments will be considerably larger; a filesystem like NTFS is optimized to support fragments up to 64MB, above which fragmentation is no longer a noticeable problem; the same is true for Unix/Linux filesystems, which rend to defragment only up to a maximum fragment size, and can efficiently handle allocation of small fragments using "pools" of free clusters organized by mimum size of 1 cluster, 2 cluster, 4 clusters, 8 clusters... in powers of two, so that defragmenting these pools is straightforward and not very costly, and can be done asynchornously in the background when there's lo level of I/O activity).
An in all modern OSes, memory management is correlated with disk storage management, using memory mapped files for handling caches: the memory is backed by a storage, managed by the virtual memory manager (which means that you can allocate more dynamic memory than you have physical RAM, the rest will be paged out to the disk when needed): the straegy you use for managing RAM for very large buffers tends to be correlated to the performance of I/O for paging: using a memory mapped file is a good solution, and everything that worked with file I/O can be done now in a very large (virtual) memory.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this question
In addition to this quite old post, I need something that will use primitives and give a speedup for an application that contains lots of HashSets of Integers:
Set<Integer> set = new HashSet<Integer>();
So people mention libraries like Guava, Javalution, Trove, but there is no perfect comparison of those in terms of benchmarks and performance results, or at least good answer coming from good experience. From what I see many recommend Trove's TIntHashSet, but others say it is not that good; some say Guava is supercool and manageable, but I do not need beauty and maintainability, only time execution, so Python's style Guava goes home :) Javalution? I've visited the website, seems too old for me and thus wacky.
The library should provide the best achievable time, memory does not matter.
Looking at "Thinking in Java", there is an idea of creating custom HashMap with int[] as keys. So I would like to see something similar with a HashSet or simply download and use an amazing library.
EDIT (in response to the comments below)
So in my project I start from about 50 HashSet<Integer> collections, then I call a function about 1000 times that inside creates up to 10 HashSet<Integer> collections. If I change initial parameters, the numbers may grow up exponentially. I only use add(), contains() and clear() methods on those collections, that is why they were chosen.
Now I'm going to find a library that implements HashSet or something similar, but will do that faster due to autoboxing Integer overhead and maybe something else which I do not know. In fact, I'm using ints as my data comes in and store them in those HashSets.
Trove is an excellent choice.
The reason why it is much faster than generic collections is memory use.
A java.util.HashSet<Integer> uses a java.util.HashMap<Integer, Integer> internally. In a HashMap, each object is contained in an Entry<Integer, Integer>. These objects take estimated 24 bytes for the Entry + 16 bytes for the actual integer + 4 bytes in the actual hash table. This yields 44 bytes, as opposed to 4 bytes in Trove, an up to 11x memory overhead (note that unoccupied entires in the main table will yield a smaller difference in practise).
See also these experiments:
http://www.takipiblog.com/2014/01/23/java-scala-guava-and-trove-collections-how-much-can-they-hold/
Take a look at the High Performance Primitive Collections for Java (HPPC). It is an alternative to trove, mature and carefully designed for efficiency. See the JavaDoc for the IntOpenHashSet.
Have you tried working with the initial capacity and load factor parameters while creating your HashSet?
HashSet doc
Initial capacity, as you might think, refers to how big will the empty hashset be when created, and loadfactor is a threshhold that determines when to grow the hash table. Normally you would like to keep the ratio between used buckets and total buckets, below two thirds, which is regarded as the best ratio to achieve good stable performance in a hash table.
Dynamic rezing of a hash table
So basically, try to set an initial capacity that will fit your needs (to avoid re-creating and reassigning the values of a hash table when it grows), as well as fiddling with the load factor until you find a sweet spot.
It might be that for your particular data distribution and setting/getting values, a lower loadfactor could help (hardly a higher one will, but your milage may vary).
I have an in-memory collection that I want to flush to disk once it has reached either a certain size (count wise) or memory footprint.
How can I determine how much memory a collection is using?
It is going to be some sort of Dictionary/Map.
You can't, easily. For example, consider an ArrayList<String> with a backing array of size 256 and an "in use" size of 200, where each string is 20 characters long, backed by a 30 character backing array.
It sounds like you could easily work out how much memory that's taking - but if every element in the array is actually a reference to the same string, then obviously it takes a lot less memory. That's just for String, which is a class which is relatively straightforward to analyze. For classes with various mixtures of definitely-distinct and possibly-shared references, it becomes even more complicated.
You could serialize it - but that only shows you how much space it takes up when serialized, not in memory.
I suggest you experiment and find some appropriate "average" size, derive a maximum count that makes sense, and just go on that basis.
I've seen many primitive examples describing how String intern()'ing works, but I have yet to see a real-life use-case that would benefit from it.
The only situation that I can dream up is having a web service that receives a considerable amount of requests, each being very similar in nature due to a rigid schema. By intern()'ing the request field names in this case, memory consumption can be significantly reduced.
Can anyone provide an example of using intern() in a production environment with great success? Maybe an example of it in a popular open source offering?
Edit: I am referring to manual interning, not the guaranteed interning of String literals, etc.
Interning can be very beneficial if you have N strings that can take only K different values, where N far exceeds K. Now, instead of storing N strings in memory, you will only be storing up to K.
For example, you may have an ID type which consists of 5 digits. Thus, there can only be 10^5 different values. Suppose you're now parsing a large document that has many references/cross references to ID values. Let's say this document have 10^9 references total (obviously some references are repeated in other parts of the documents).
So N = 10^9 and K = 10^5 in this case. If you are not interning the strings, you will be storing 10^9 strings in memory, where lots of those strings are equals (by Pigeonhole Principle). If you intern() the ID string you get when you're parsing the document, and you don't keep any reference to the uninterned strings you read from the document (so they can be garbage collected), then you will never need to store more than 10^5 strings in memory.
We had a production system that processes literally millions of pieces of data at a time, many of which have string fields. We should have been interning strings, but there was a bug which meant we were not. By fixing the bug we avoided having to do a very costly (at least 6 figures, possibly 7) server upgrade.
Examples where interning will be beneficial involve a large numbers strings where:
the strings are likely to survive multiple GC cycles, and
there are likely to be multiple copies of a large percentage of the Strings.
Typical examples involve splitting / parsing a text into symbols (words, identifiers, URIs) and then attaching those symbols to long-lived data structures. XML processing, programming language compilation and RDF / OWL triple stores spring to mind as applications where interning is likely to be beneficial.
But interning is not without its problems, especially if it turns out that the assumptions above are not correct:
the pool data structure used to hold the interned strings takes extra space,
interning takes time, and
interning doesn't prevent the creation of the duplicate string in the first place.
Finally, interning potentially increases GC overheads by increasing the number of objects that need to be traced and copied, and by increasing the number of weak references that need to be dealt with. This increase in overheads has to be balanced against the decrease in GC overheads that results from effective interning.
Not a complete answer but additional food for thought (found here):
Therefore, the primary benefit in this case is that using the == operator for internalized strings is a lot faster than use the equals() method [for not internalized Strings]. So, use the intern() method if you're going to be comparing strings more than a time or three.
Never, ever, use intern on user-supplied data, as that can cause denial of service attacks (as intern()ed strings are never freed). You can do validation on the user-supplied strings, but then again you've done most of the work needed for intern().