I was under the impression that StringBuffer is the fastest way to concatenate strings, but I saw this Stack Overflow post saying that concat() is the fastest method. I tried the 2 given examples in Java 1.5, 1.6 and 1.7 but I never got the results they did. My results are almost identical to this
Can somebody explain what I don't understand here? What is truly the fastest way to concatenate strings in Java?
Is there a different answer when one seeks the fastest way to concatenate two strings and when concatenating multiple strings?
String.concat is faster than the + operator if you are concatenating two strings... Although this can be fixed at any time and may even have been fixed in java 8 as far as I know.
The thing you missed in the first post you referenced is that the author is concatenating exactly two strings, and the fast methods are the ones where the size of the new character array is calculated in advance as str1.length() + str2.length(), so the underlying character array only needs to be allocated once.
Using StringBuilder() without specifying the final size, which is also how + works internally, will often need to do more allocations and copying of the underlying array.
If you need to concatenate a bunch of strings together, then you should use a StringBuilder. If it's practical, then precompute the final size so that the underlying array only needs to be allocated once.
What I understood from others answer is following:
If you need thread safety, use StringBuffer
If you do not need thread safety:
If strings are known before hand and for some reasons multiple time same code needs to be run, use '+' as compiler will optimize and handle it during compile time itself.
if only two strings need to be concatenated, use concat() as it will not require StringBuilder/StringBuffer objects to be created. Credits to #nickb
If multiple strings need to be concatenated, use StringBuilder.
Joining very long lists os strings by naively addinging them from start to end is very slow: the padded buffer grows incrementally, and is reallocated again and again, making additional copies (and sollicitating a lot the garbage collector).
The most efficient way to join long lists is to always start by joining pairs of adjascent strings whose total length is the smallest from ALL other candidate pairs; however this would require a complex lookup to find the optimal pair (similar to the wellknown problem of Hanoi towers), and finding it only to reduce the numebr of copies to the strict minimum would slow down things.
What you need a smart algorithm using a "divide and conquer" recursive algorithm with a good heuristic which is very near from this optimum:
If you have no string to join, return the empty string.
If you have only 1 string to join, just return it.
Otherwise if you have only 2 strings to join, join them and return the result.
Compute the total length of the final result.
Then determine the number of strings to join from the left until it reaches half of this total to determine the "divide" point splitting the set of strings in two non-empty parts (each part must contain at least 1 string, the division point cannot be the 1st or last string from te set to join).
Join the smallest part if it has at least 2 strings to join, otherwise join the other part (using this algorithm recursively).
Loop back to the beginning (1.) to complete the other joins.
Note that empty strings in the collection have to be ignored as if they were not part of the set.
Many default implementations of String.join(table of string, optional separator) found in various libraries are slow as they are using naive incremental joinining from left to right; the divide-and-conquer algorithm above will outperform it, when you need to join MANY small string to generate a very large string.
Such situation is not exceptional, it occurs in text preprocessors and generators, or in HTML processing (e.g. in "Element.getInnerText()" when the element is a large document containing many text elements separated or contained by many named elements).
The strategy above works when the source strings are all (or almost all to be garbage collected to keep only the final result. If th result is kept together as long as the list of source strings, the best alternative is to allocate the final large buffer of the result only once for its total length, then copy source strings from left to right.
In both cases, this requires a first pass on all strings to compute their total length.
If you usse a reallocatable "string buffer", this does not work well if the "string buffer" reallocates constantly. However, the string buffer may be useful when performing the first pass, to prejoin some short strings that can fit in it, with a reasonnable (medium) size (e.g. 4KB for one page of memory): once it is full, replace the subset of strings by the content of the string buffer, and allocate a new one.
This can considerably reduce the number of small strings in the source set, and after the first pass, you have the total length for the final buffer to allocate for the result, where you'll copy incrementally all the remaining medium-size strings collected in the first pass This works very well when the list of source strings come from a parser function or generator, where the total length is not fully known before the end of parsing/generation: you'll use only intermediate stringbuffers with medium size, and finally you'll generate the final buffer without reparsing again the input (to get many incremental fragments) or without calling the generator repeatedly (this would be slow or would not work for some generators or if the input of the parser is consumed and not recoverable from the start).
Note that this remarks also applies not just to joinind strings, but also to file I/O: writing the file incrementally also suffers from reallocation or fragmentation: you should be able to precompute the total final length of the generated file. Otherwise you need a classig buffer (implemented in most file I/O libraries, and usually sized in memory at about one memory page of 4KB, but you should allocate more because file I/O is considerably slower, and fragmentation becomes a performance problem for later file accesses when file fragments are allocated incrementalyy by too small units of just one "cluster"; using a buffer at about 1MB will avoid most pperformance problems caused by fragmented allocation on the file system as fragments will be considerably larger; a filesystem like NTFS is optimized to support fragments up to 64MB, above which fragmentation is no longer a noticeable problem; the same is true for Unix/Linux filesystems, which rend to defragment only up to a maximum fragment size, and can efficiently handle allocation of small fragments using "pools" of free clusters organized by mimum size of 1 cluster, 2 cluster, 4 clusters, 8 clusters... in powers of two, so that defragmenting these pools is straightforward and not very costly, and can be done asynchornously in the background when there's lo level of I/O activity).
An in all modern OSes, memory management is correlated with disk storage management, using memory mapped files for handling caches: the memory is backed by a storage, managed by the virtual memory manager (which means that you can allocate more dynamic memory than you have physical RAM, the rest will be paged out to the disk when needed): the straegy you use for managing RAM for very large buffers tends to be correlated to the performance of I/O for paging: using a memory mapped file is a good solution, and everything that worked with file I/O can be done now in a very large (virtual) memory.
Related
This question already has answers here:
HashSet of Strings taking up too much memory, suggestions...?
(8 answers)
Closed 7 years ago.
I need to store a large dictionary of natural language words -- up to 120,000, depending on the language. These need to be kept in memory as profiling has shown that the algorithm which utilises the array is the time bottleneck in the system. (It's essentially a spellchecking/autocorrect algorithm, though the details don't matter.) On Android devices with 16MB memory, the memory overhead associated with Java Strings is causing us to run out of space. Note that each String has a 38 byte overhead associated with it, which gives up to a 5MB overhead.
At first sight, one option is to substitute char[] for String. (Or even byte[], as UTF-8 is more compact in this case.) But again, the memory overhead is an issue: each Java array has a 32 byte overhead.
One alternative to ArrayList<String>, etc. is to create an class with much the same interface that internally concatenates all the strings into one gigantic string, e.g. represented as a single byte[], and then store offsets into that huge string. Each offset would take up 4 bytes, giving a much more space-efficient solution.
My questions are a) are there any other solutions to the problem with similarly low overheads* and b) is any solution available off-the-shelf? Searching through the Guava, trove and PCJ collection libraries yields nothing.
*I know one can get the overhead down below 4 bytes, but there are diminishing returns.
NB. Support for Compressed Strings being Dropped in HotSpot JVM? suggests that the JVM option -XX:+UseCompressedStrings isn't going to help here.
I had to develop a word dictionary for a class project. we ended up using a trie as a data structure. Not sure the size difference between an arrraylist and a trie, but the performance is a lot better.
Here are some resources that could be helpful.
https://en.wikipedia.org/wiki/Trie
https://www.topcoder.com/community/data-science/data-science-tutorials/using-tries/
Which is more expensive operation swap or comparison in integer array in Java ? Or they can all be thought as same ?
Context: Sorting for an almost sorted array (I am not talking about k-sorted array where each element is displaced from the right position by at most k). Even if we use insertion sort, number of comparisons by the end will be same as they would have been for any array or for worst case. Isn't it ? It is just that swaps will be fewer. Please correct if I am wrong.
Swap should be more expensive because it includes:
Reading data from memory to cache
Reading data from cache to registers
Writing data back to cache
Comparison should be less expensive because it includes:
Reading data from memory to cache
Reading data from cache to registers
Executing single compare operations on two registers (which should be a little faster than writing two integers into a cache)
But modern processors are complex and different from each other, so the best way to get the right answer is to benchmark your code.
I am working on a project where I am processing a lot of tweets; the goal is to remove duplicates as I process them. I have the tweet IDs, which come in as strings of the format "166471306949304320"
I have been using a HashSet<String> for this, which works fine for a while. But by the time I get to around 10 million items I am drastically bogged down and eventually get a GC error, presumably from the rehashing. I tried defining a better size/load with
tweetids = new HashSet<String>(220000,0.80F);
and that lets it get a little farther, but is still excruciatingly slow (by around 10 million it is taking 3x as long to process). How can I optimize this? Given that I have an approximate idea of how many items should be in the set by the end (in this case, around 20-22 million) should I create a HashSet that rehashes only two or three times, or would the overhead for such a set incur too many time-penalties? Would things work better if I wasn't using a String, or if I define a different HashCode function (which, in this case of a particular instance of a String, I'm not sure how to do)? This portion of the implementation code is below.
tweetids = new HashSet<String>(220000,0.80F); // in constructor
duplicates = 0;
...
// In loop: For(each tweet)
String twid = (String) tweet_twitter_data.get("id");
// Check that we have not processed this tweet already
if (!(tweetids.add(twid))){
duplicates++;
continue;
}
SOLUTION
Thanks to your recommendations, I solved it. The problem was the amount of memory required for the hash representations; first, HashSet<String> was simply enormous and uncalled for because the String.hashCode() is exorbitant for this scale. Next I tried a Trie, but it crashed at just over 1 million entries; reallocating the arrays was problematic. I used a HashSet<Long> to better effect and almost made it, but speed decayed and it finally crashed on the last leg of the processing (around 19 million). The solution came with departing from the standard library and using Trove. It finished 22 million records a few minutes faster than not checking duplicates at all. Final implementation was simple, and looked like this:
import gnu.trove.set.hash.TLongHashSet;
...
TLongHashSet tweetids; // class variable
...
tweetids = new TLongHashSet(23000000,0.80F); // in constructor
...
// inside for(each record)
String twid = (String) tweet_twitter_data.get("id");
if (!(tweetids.add(Long.parseLong(twid)))) {
duplicates++;
continue;
}
You may want to look beyond the Java collections framework. I've done some memory intensive processing and you will face several problems
The number of buckets for large hashmaps and hash sets is going to
cause a lot of overhead (memory). You can influence this by using
some kind of custom hash function and a modulo of e.g. 50000
Strings are represented using 16 bit characters in Java. You can halve that by using utf-8 encoded byte arrays for most scripts.
HashMaps are in general quite wasteful data structures and HashSets are basically just a thin wrapper around those.
Given that, take a look at trove or guava for alternatives. Also, your ids look like longs. Those are 64 bit, quite a bit smaller than the string representation.
An alternative you might want to consider is using bloom filters (guava has a decent implementation). A bloom filter would tell you if something is definitely not in a set and with reasonable certainty (less than 100%) if something is contained. That combined with some disk based solution (e.g. database, mapdb, mecached, ...) should work reasonably well. You could buffer up incoming new ids, write them in batches, and use the bloom filter to check if you need to look in the database and thus avoid expensive lookups most of the time.
If you are just looking for the existence of Strings, then I would suggest you try using a Trie(also called a Prefix Tree). The total space used by a Trie should be less than a HashSet, and it's quicker for string lookups.
The main disadvantage is that it can be slower when used from a harddisk as it's loading a tree, not a stored linearly structure like a Hash. So make sure that it can be held inside of RAM.
The link I gave is a good list of pros/cons of this approach.
*as an aside, the bloom filters suggested by Jilles Van Gurp are great fast prefilters.
Simple, untried and possibly stupid suggestion: Create a Map of Sets, indexed by the first/last N characters of the tweet ID:
Map<String, Set<String>> sets = new HashMap<String, Set<String>>();
String tweetId = "166471306949304320";
sets.put(tweetId.substr(0, 5), new HashSet<String>());
sets.get(tweetId.substr(0, 5)).add(tweetId);
assert(sets.containsKey(tweetId.substr(0, 5)) && sets.get(tweetId.substr(0, 5)).contains(tweetId));
That easily lets you keep the maximum size of the hashing space(s) below a reasonable value.
The task is to count the num of words from a input file.
the input file is 8 chars per line, and there are 10M lines, for example:
aaaaaaaa
bbbbbbbb
aaaaaaaa
abcabcab
bbbbbbbb
...
the output is:
aaaaaaaa 2
abcabcab 1
bbbbbbbb 2
...
It'll takes 80MB memory if I load all of words into memory, but there are only 60MB in os system, which I can use for this task. So how can I solve this problem?
My algorithm is to use map<String,Integer>, but jvm throw Exception in thread "main" java.lang.OutOfMemoryError: Java heap space. I know I can solve this by setting -Xmx1024m, for example, but I want to use less memory to solve it.
I believe that the most robust solution is to use the disk space.
For example you can sort your file in another file, using an algorithm for sorting large files (that use disk space), and then count the consecutive occurrences of the same word.
I believe that this post can help you. Or search by yourself something about external sorting.
Update 1
Or as #jordeu suggest you can use a Java embedded database library: like H2, JavaDB, or similars.
Update 2
I thought about another possible solution, using Prefix Tree. However I still prefer the first one, because I'm not an expert on them.
Read one line at a time
and then have e.g. a HashMap<String,Integer>
where you put your words as key and the count as integer.
If a key exists, increase the count. Otherwise add the key to the map with a count of 1.
There is no need to keep the whole file in memory.
I guess you mean the number of distinct words do you?
So the obvious approach is to store (distinctive information about) each different word as a key in a map, where the value is the associated counter. Depending on how many distinct words are expected, storing all of them may even fit into your memory, however not in the worst case scenario when all words are different.
To lessen memory needs, you could calculate a checksum for the words and store that, instead of the words themselves. Storing e.g. a 4-byte checksum instead of an 8-character word (requiring at least 9 bytes to store) requires 40M instead of 90M. Plus you need a counter for each word too. Depending on the expected number of occurrences for a specific word, you may be able to get by with 2 bytes (for max 65535 occurrences), which requires max 60M of memory for 10M distinct words.
Update
Of course, the checksum can be calculated in many different ways, and it can be lossless or not. This also depends a lot on the character set used in the words. E.g. if only lowercase standard ASCII characters are used (as shown in the examples above), we have 26 different characters at each position. Consequently, each character can be losslessly encoded in 5 bits. Thus 8 characters fit into 5 bytes, which is a bit more than the limit, but may be dense enough, depending on the circumstances.
I suck at explaining theoretical answers but here we go....
I have made an assumption about your question as it is not entirely clear.
The memory used to store all the distinct words is 80MB (the entire file is bigger).
The words could contain non-ascii characters (so we just treat the data as raw bytes).
It is sufficient to read over the file twice storing ~ 40MB of distinct words each time.
// Loop over the file and for each word:
//
// Compute a hash of the word.
// Convert the hash to a number by some means (skip if possible).
// If the number is odd then skip to the next word.
// Use conventional means to store the distinct word.
//
// Do something with all the distinct words.
Then repeat the above a second time using even instead of odd.
Then you have divided the task into 2 and can do each separately.
No words from the first set will appear in the second set.
The hash is necessary because the words could (in theory) all end with the same letter.
The solution can be extended to work with different memory constraints. Rather than saying just odd/even we can divide the words into X groups by using number MOD X.
Use H2 Database Engine, it can work on disc or on memory if it's necessary. And it have a really good performance.
I'd create a SHA-1 of each word, then store these numbers in a Set. Then, of course, when reading a number, check the Set if it's there [(not totally necessary since Set by definition is unique, so you can just "add" its SHA-1 number also)]
Depending on what kind of character the words are build of you can chose for this system:
If it might contain any character of the alphabet in upper and lower case, you will have (26*2)^8 combinations, which is 281474976710656. This number can fit in a long datatype.
So compute the checksum for the strings like this:
public static long checksum(String str)
{
String tokes = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";
long checksum = 0;
for (int i = 0; i < str.length(); ++i)
{
int c = tokens.indexOf(str.charAt(i));
checksum *= tokens.length();
checksum += c;
}
return checksum;
}
This will reduce the taken memory per word by more than 8 bytes. A string is an array of char, each char is in Java 2 bytes. So, 8 chars = 16 bytes. But the string class contains more data than only the char array, it contains some integers for size and offset as well, which is 4 bytes per int. Don't forget the memory pointer to the Strings and char arrays as well. So, a raw estimation makes me think that this will reduce 28 bytes per word.
So, 8 bytes per word and you have 10 000 000 words, gives 76 MB. Which is your first wrong estimation, because you forgot all the things I noticed. So this means that even this method won't work.
You can convert each 8 byte word into a long and use TLongIntHashMap which is quite a bit more efficient than Map<String, Integer> or Map<Long, Integer>
If you just need the distinct words you can use TLongHashSet
If you can sort your file first (e.g. using the memory-efficient "sort" utility on Unix), then it's easy. You simply read the the sorted items, counting the neighboring duplicates as you go, and write the totals to a new file immediately.
If you need to sort using Java, this post might help:
http://www.codeodor.com/index.cfm/2007/5/10/Sorting-really-BIG-files/1194
You can use constant memory by reading your file multiple times.
Basic idea:
Treat the file as n partitions p_1...p_n, sized so that you can load each of them into ram.
Load p_i into a Map structure, scan through the whole file and keep track of counts of the p_i elements only (see answer of Heiko Rupp)
Remove element if we encounter the same value in a partition p_j with j smaller i
Output result counts for elements in the Map
Clear Map, repeat for all p_1...p_n
As in any optimization, there are tradeoffs. In your case, you can do the same task with less memory but it comes at the cost of increasing runtime.
Your scarce resource is memory, so you can't store the words in RAM.
You could use a hash instead of the word as other posts mention, but if your file grows in size this is no solution, since at some point you'll run into the same problem again.
Yes, you could use an external web server to crunch the file and do the job for your client app, but reading your question it seems that you want to do all the thing in one (your app).
So my proposal is to iterate over the file, and for each word:
If the word was found for first time, write the string to a result file together with the integer value 1.
If the word was processed before (it will appear in the result file), increment the record value.
This solution scales well no matter the number of lines of your input file nor the length of the words*.
You can optimize the way you do the writes in the output file, so that the search is made faster, but the basic version described above is enough to work.
EDIT:
*It scales well until you run out of disk space XD. So the precondition would be to have a disk with at least 2N bytes of free usable space, where N is the input file size in bytes.
possible solutions:
Use file sorting and then just count the consequent occurences of each value.
Load the file in a database and use a count statement like this: select value, count(*) from table group by value
I've seen many primitive examples describing how String intern()'ing works, but I have yet to see a real-life use-case that would benefit from it.
The only situation that I can dream up is having a web service that receives a considerable amount of requests, each being very similar in nature due to a rigid schema. By intern()'ing the request field names in this case, memory consumption can be significantly reduced.
Can anyone provide an example of using intern() in a production environment with great success? Maybe an example of it in a popular open source offering?
Edit: I am referring to manual interning, not the guaranteed interning of String literals, etc.
Interning can be very beneficial if you have N strings that can take only K different values, where N far exceeds K. Now, instead of storing N strings in memory, you will only be storing up to K.
For example, you may have an ID type which consists of 5 digits. Thus, there can only be 10^5 different values. Suppose you're now parsing a large document that has many references/cross references to ID values. Let's say this document have 10^9 references total (obviously some references are repeated in other parts of the documents).
So N = 10^9 and K = 10^5 in this case. If you are not interning the strings, you will be storing 10^9 strings in memory, where lots of those strings are equals (by Pigeonhole Principle). If you intern() the ID string you get when you're parsing the document, and you don't keep any reference to the uninterned strings you read from the document (so they can be garbage collected), then you will never need to store more than 10^5 strings in memory.
We had a production system that processes literally millions of pieces of data at a time, many of which have string fields. We should have been interning strings, but there was a bug which meant we were not. By fixing the bug we avoided having to do a very costly (at least 6 figures, possibly 7) server upgrade.
Examples where interning will be beneficial involve a large numbers strings where:
the strings are likely to survive multiple GC cycles, and
there are likely to be multiple copies of a large percentage of the Strings.
Typical examples involve splitting / parsing a text into symbols (words, identifiers, URIs) and then attaching those symbols to long-lived data structures. XML processing, programming language compilation and RDF / OWL triple stores spring to mind as applications where interning is likely to be beneficial.
But interning is not without its problems, especially if it turns out that the assumptions above are not correct:
the pool data structure used to hold the interned strings takes extra space,
interning takes time, and
interning doesn't prevent the creation of the duplicate string in the first place.
Finally, interning potentially increases GC overheads by increasing the number of objects that need to be traced and copied, and by increasing the number of weak references that need to be dealt with. This increase in overheads has to be balanced against the decrease in GC overheads that results from effective interning.
Not a complete answer but additional food for thought (found here):
Therefore, the primary benefit in this case is that using the == operator for internalized strings is a lot faster than use the equals() method [for not internalized Strings]. So, use the intern() method if you're going to be comparing strings more than a time or three.
Never, ever, use intern on user-supplied data, as that can cause denial of service attacks (as intern()ed strings are never freed). You can do validation on the user-supplied strings, but then again you've done most of the work needed for intern().