Java: optimize hashset for large-scale duplicate detection

Java: optimize hashset for large-scale duplicate detection - java

I am working on a project where I am processing a lot of tweets; the goal is to remove duplicates as I process them. I have the tweet IDs, which come in as strings of the format "166471306949304320"
I have been using a HashSet<String> for this, which works fine for a while. But by the time I get to around 10 million items I am drastically bogged down and eventually get a GC error, presumably from the rehashing. I tried defining a better size/load with
tweetids = new HashSet<String>(220000,0.80F);
and that lets it get a little farther, but is still excruciatingly slow (by around 10 million it is taking 3x as long to process). How can I optimize this? Given that I have an approximate idea of how many items should be in the set by the end (in this case, around 20-22 million) should I create a HashSet that rehashes only two or three times, or would the overhead for such a set incur too many time-penalties? Would things work better if I wasn't using a String, or if I define a different HashCode function (which, in this case of a particular instance of a String, I'm not sure how to do)? This portion of the implementation code is below.
tweetids = new HashSet<String>(220000,0.80F); // in constructor
duplicates = 0;
...
// In loop: For(each tweet)
String twid = (String) tweet_twitter_data.get("id");
// Check that we have not processed this tweet already
if (!(tweetids.add(twid))){
duplicates++;
continue;
}
SOLUTION
Thanks to your recommendations, I solved it. The problem was the amount of memory required for the hash representations; first, HashSet<String> was simply enormous and uncalled for because the String.hashCode() is exorbitant for this scale. Next I tried a Trie, but it crashed at just over 1 million entries; reallocating the arrays was problematic. I used a HashSet<Long> to better effect and almost made it, but speed decayed and it finally crashed on the last leg of the processing (around 19 million). The solution came with departing from the standard library and using Trove. It finished 22 million records a few minutes faster than not checking duplicates at all. Final implementation was simple, and looked like this:
import gnu.trove.set.hash.TLongHashSet;
...
TLongHashSet tweetids; // class variable
...
tweetids = new TLongHashSet(23000000,0.80F); // in constructor
...
// inside for(each record)
String twid = (String) tweet_twitter_data.get("id");
if (!(tweetids.add(Long.parseLong(twid)))) {
duplicates++;
continue;
}

You may want to look beyond the Java collections framework. I've done some memory intensive processing and you will face several problems
The number of buckets for large hashmaps and hash sets is going to
cause a lot of overhead (memory). You can influence this by using
some kind of custom hash function and a modulo of e.g. 50000
Strings are represented using 16 bit characters in Java. You can halve that by using utf-8 encoded byte arrays for most scripts.
HashMaps are in general quite wasteful data structures and HashSets are basically just a thin wrapper around those.
Given that, take a look at trove or guava for alternatives. Also, your ids look like longs. Those are 64 bit, quite a bit smaller than the string representation.
An alternative you might want to consider is using bloom filters (guava has a decent implementation). A bloom filter would tell you if something is definitely not in a set and with reasonable certainty (less than 100%) if something is contained. That combined with some disk based solution (e.g. database, mapdb, mecached, ...) should work reasonably well. You could buffer up incoming new ids, write them in batches, and use the bloom filter to check if you need to look in the database and thus avoid expensive lookups most of the time.

If you are just looking for the existence of Strings, then I would suggest you try using a Trie(also called a Prefix Tree). The total space used by a Trie should be less than a HashSet, and it's quicker for string lookups.
The main disadvantage is that it can be slower when used from a harddisk as it's loading a tree, not a stored linearly structure like a Hash. So make sure that it can be held inside of RAM.
The link I gave is a good list of pros/cons of this approach.
*as an aside, the bloom filters suggested by Jilles Van Gurp are great fast prefilters.

Simple, untried and possibly stupid suggestion: Create a Map of Sets, indexed by the first/last N characters of the tweet ID:
Map<String, Set<String>> sets = new HashMap<String, Set<String>>();
String tweetId = "166471306949304320";
sets.put(tweetId.substr(0, 5), new HashSet<String>());
sets.get(tweetId.substr(0, 5)).add(tweetId);
assert(sets.containsKey(tweetId.substr(0, 5)) && sets.get(tweetId.substr(0, 5)).contains(tweetId));
That easily lets you keep the maximum size of the hashing space(s) below a reasonable value.

Related

Implementing efficient data structure using Arrays only

As part of my programming course I was given an exercise to implement my own String collection. I was planning on using ArrayList collection or similar but one of the constraints is that we are not allowed to use any Java API to implement it, so only arrays are allowed. I could have implemented this using arrays however efficiency is very important as well as the amount of data that this code will be tested with. I was suggested to use hash tables or ordered tress as they are more efficient than arrays. After doing some research I decided to go with hash tables because they seemed easy to understand and implement but once I started writing code I realised it is not as straight forward as I thought.
So here are the problems I have come up with and would like some advice on what is the best approach to solve them again with efficiency in mind:
ACTUAL SIZE: If I understood it correctly hash tables are not ordered (indexed) so that means that there are going to be gaps in between items because hash function gives different indices. So how do I know when array is full and I need to resize it?
RESIZE: One of the difficulties that I need to create a dynamic data structure using arrays. So if I have an array String[100] once it gets full I will need to resize it by some factor I decided to increase it by 100 each time so once I would do that I would need to change positions of all existing values since their hash keys will be different as the key is calculated:
int position = "orange".hashCode() % currentArraySize;
So if I try to find a certain value its hash key will be different from what it was when array was smaller.
HASH FUNCTION: I was also wondering if built-in hashCode() method in String class is efficient and suitable for what I am trying to implement or is it better to create my own one.
DEALING WITH MULTIPLE OCCURRENCES: one of the requirements is to be able to add multiple words that are the same, because I need to be able to count how many times the word is stored in my collection. Since they are going to have the same hash code I was planning to add the next occurrence at the next index hoping that there will be a gap. I don't know if it is the best solution but here how I implemented it:
public int count(String word) {
int count = 0;
while (collection[(word.hashCode() % size) + count] != null && collection[(word.hashCode() % size) + count].equals(word))
count++;
return count;
}
Thank you in advance for you advice. Please ask anything needs to be clarified.
P.S. The length of words is not fixed and varies greatly.
UPDATE Thank you for your advice, I know I did do few stupid mistakes there I will try better. So I took all your suggestions and quickly came up with the following structure, it is not elegant but I hope it is what you roughly what you meant. I did have to make few judgements such as bucket size, for now I halve the size of elements, but is there a way to calculate or some general value? Another uncertainty was as to by what factor to increase my array, should I multiply by some n number or adding fixed number is also applicable? Also I was wondering about general efficiency because I am actually creating instances of classes, but String is a class to so I am guessing the difference in performance should not be too big?

ACTUAL SIZE: The built-in Java HashMap just resizes when the total number of elements exceeds the number of buckets multiplied by a number called the load factor, which is by default 0.75. It does not take into account how many buckets are actually full. You don't have to, either.
RESIZE: Yes, you'll have to rehash everything when the table is resized, which does include recomputing its hash.
So if I try to find a certain value it's hash key will be different from what it was when array was smaller.
Yup.
HASH FUNCTION: Yes, you should use the built in hashCode() function. It's good enough for basic purposes.
DEALING WITH MULTIPLE OCCURRENCES: This is complicated. One simple solution would just be to have the hash entry for a given string also keep count of how many occurrences of that string are present. That is, instead of keeping multiple copies of the same string in your hash table, keep an int along with each String counting its occurrences.

So how do I know when array is full and I need to resize it?
You keep track of the size and HashMap does. When the size used > capacity * load factor you grow the underlying array, either as a whole or in part.
int position = "orange".hashCode() % currentArraySize;
Some things to consider.
The % of a negative value is a negative value.
Math.abs can return a negative value.
Using & with a bit mask is faster however you need a size which is a power of 2.
I was also wondering if built-in hashCode() method in String class is efficient and suitable for what I am trying to implement or is it better to create my own one.
The built in hashCode is cached, so it is fast. However it is not a great hashCode and has poor randomness for lower bit, and higher bit for short strings. You might want to implement your own hashing strategy, possibly a 64-bit one.
DEALING WITH MULTIPLE OCCURRENCES:
This is usually done with a counter for each key. This way you can have say 32767 duplicates (if you use short) or 2 billion (if you use int) duplicates of the same key/element.

Efficient Intersection and Union of Lists of Strings

I need to efficiently find the ratio of (intersection size / union size) for pairs of Lists of strings. The lists are small (mostly about 3 to 10 items), but I have a huge number of them (~300K) and have to do this on every pair, so I need this actual computation to be as efficient as possible. The strings themselves are short unicode strings -- averaging around 5-10 unicode characters.
The accepted answer here Efficiently compute Intersection of two Sets in Java? looked extremely helpful but (likely because my sets are small (?)) I haven't gotten much improvement by using the approach suggested in the accepted answer.
Here's what I have so far:
protected double uuEdgeWeight(UVertex u1, UVertex u2) {
Set<String> u1Tokens = new HashSet<String>(u1.getTokenlist());
List<String> u2Tokens = u2.getTokenlist();
int intersection = 0;
int union = u1Tokens.size();
for (String s:u2Tokens) {
if (u1Tokens.contains(s)) {
intersection++;
} else {
union++;
}
}
return ((double) intersection / union);
My question is, is there anything I can do to improve this, given that I'm working with Strings which may be more time consuming to check equality than other data types.
I think because I'm comparing multiple u2's against the same u1, I could get some improvement by doing the cloning of u2 into a HashSet outside of the loop (which isn't shown -- meaning I'd pass in the HashSet instead of the object from which I could pull the list and then clone into a set)
Anything else I can do to squeak out even a small improvement here?
Thanks in advance!
Update
I've updated the numeric specifics of my problem above. Also, due to the nature of the data, most (90%?) of the intersections are going to be empty. My initial attempt at this used the clone the set and then retainAll the items in the other set approach to find the intersection, and then shortcuts out before doing the clone and addAll to find the union. That was about as efficient as the code posted above, presumably because of the trade of between it being a slower algorithm overall versus being able to shortcut out a lot of the time. So, I'm thinking about ways to take advantage of the infrequency of overlapping sets, and would appreciate any suggestions in that regard.
Thanks in advance!

You would get a large improvement by moving the HashSet outside of the loop.
If the HashSet really has only got a few entries in it then you are probably actually just as fast to use an Array - since traversing an array is much simpler/faster. I'm not sure where the threshold would lie but I'd measure both - and be sure that you do the measurements correctly. (i.e. warm up loops before timed loops, etc).
One thing to try might be using a sorted array for the things to compare against. Scan until you go past current and you can immediately abort the search. That will improve processor branch prediction and reduce the number of comparisons a bit.

If you want to optimize for this function (not sure if it actually works in your context) you could assign each unique String an Int value, when the String is added to the UVertex set that Int as a bit in a BitSet.
This function should then become a set.or(otherset) and a set.and(otherset). Depending on the number of unique Strings that could be efficient.

Looking for memory efficient design

I'm running some experiments over a large dataset and would like to optimize a particular part. Currently, I have 5-6 Models each of which stores a mapping from Topics to List of Strings. The set of Topics is large and the same between each Model, so there must be a better way. Ultimately the query I need to perform is: what is the String in position x of the List for some Model-Topic combination.
One of the problems with using the mapping method is that if there are say 500k-5M topics, each has a list of 20 strings. Then my Map<Model, Map<Topic, List<String>>> is going to be massive.

Have you tried SortedSet / Maps? Sounds like you need to optimize your search, sorted collections (like TreeMap) should be log(n) while regular list is O(1). Of course, this kind of thing is something at which databases excel...

Not clear where/how you want to achieve "memory efficiency". First one needs to look at the particulars of your detailed data to see how much storage that consumes, then examine various ways of organizing it and analyze their efficiency in terms of % overhead vs your "real" data.
A brief glance shows that a HashMap, when you consider the associated tables, has about 80 bytes of overhead per entry. An ArrayList looks to average out around 10-12. Without looking, I would guess that a TreeMap would be more than a HashMap -- maybe 100.
Generally speaking, links within your own objects will be "cheaper", both in storage and speed to access, than links using these aggregating objects. But the aggregating objects are convenient to use, and have been "optimized" to a degree.
(But looking at your update, you probably should be looking at a DB application, rather than holding everything in heap.)

You could use Topic and Model to construct a composite key in a single Map, e.g.
map.put(topic1_id + model1_id, list1_1);
map.put(topic1_id + model2_id, list1_2);
...
map.get(topic_id + model_id)
where the IDs are Strings (or a similar scheme could be used with numeric identifiers).
A similar approach is to assign each topic and model a unique number, then store the lists of strings in arrays, so looking up the list for a given combination is a matter of looking up two indexes, then accessing a given location in a 2D array. (however, this is easier when you know the number of topics and models in advance of constructing the data structure)
For memory efficiency, also consider the small details. In general, you want to minimise the number of Objects - each Object carries an overhead. ArrayLists can have a lot of wasted space as they grow dynamically, doubling in size when they exceed their current capacity. If you can pre-size them to the required capacity (or use an array instead) then you can save a lot of memory. The same applies when using large numbers of small HashMaps.

One possible data structure is a hierarchy of maps, leading to an array of Strings. E.g.:
HashMap<Model, HashMap<Topic, String[]>> map;
A query function would then look like:
public String query(Model model, Topic topic, int x) {
HashMap<Topic, String[]> childMap = map.get(model);
if (childMap == null) {
return null;
}
String[] list = childMap.get(topic);
if (list == null) {
return null;
}
return list[x];
}
Presuming your Model and Topic structures implement hashCode() and equals() reasonably, the query performance should be quite good.
One potential weakness: I'm assuming you need to index a large number of Model/Topic combinations, and related lists of Strings (if not, you presumably wouldn't be asking about optimization). My guess is that the child String[] arrays will consume a large amount of memory. Each array is a Java object (about 20 bytes) + a pointer at each array location.
2 suggestions there:
1) If many Model/Topic combinations share the same set of Strings, you could gain quite a lot by sharing those String[] instances.
2) If you're using a 64-bit VM, be sure to use compressed ordinary object pointers (-XX:+UseCompressedOops). That will at least keep most of the pointers to 4 bytes instead of 8. Compressed OOPs is the default since 1.6.0_23, so a relatively recent VM will save you some memory here.

One other possibility not mentioned is store the strings using String[][][] and models and topics in a List such as ArrayList and then at query time:
public String query(Model model, Topic topic, int x) {
return strings[models.indexOf(model)][topics.indexOf(topic)][x];
}
It could be further improved for speed if the topics and models were sorted, then binary search rather than indexOf could be used.

How to delete duplicate/aggregate rows faster in a file using Java (no DB)

I have a 2GB big text file, it has 5 columns delimited by tab.
A row will be called duplicate only if 4 out of 5 columns matches.
Right now, I am doing dduping by first loading each coloumn in separate List
, then iterating through lists, deleting the duplicate rows as it encountered and aggregating.
The problem: it is taking more than 20 hours to process one file.
I have 25 such files to process.
Can anyone please share their experience, how they would go about doing such dduping?
This dduping will be a throw away code. So, I was looking for some quick/dirty solution, to get job done as soon as possible.
Here is my pseudo code (roughly)
Iterate over the rows
i=current_row_no.
Iterate over the row no. i+1 to last_row
if(col1 matches //find duplicate
&& col2 matches
&& col3 matches
&& col4 matches)
{
col5List.set(i,get col5); //aggregate
}
Duplicate example
A and B will be duplicate A=(1,1,1,1,1), B=(1,1,1,1,2), C=(2,1,1,1,1) and output would be A=(1,1,1,1,1+2) C=(2,1,1,1,1) [notice that B has been kicked out]

A HashMap will be your best bet. In a single, constant time operation, you can both check for duplication and fetch the appropriate aggregation structure (a Set in my code). This means that you can traverse the entire file in O(n). Here's some example code:
public void aggregate() throws Exception
{
BufferedReader bigFile = new BufferedReader(new FileReader("path/to/file.csv"));
// Notice the paramter for initial capacity. Use something that is large enough to prevent rehashings.
Map<String, HashSet<String>> map = new HashMap<String, HashSet<String>>(500000);
while (bigFile.ready())
{
String line = bigFile.readLine();
int lastTab = line.lastIndexOf('\t');
String firstFourColumns = line.substring(0, lastTab);
// See if the map already contains an entry for the first 4 columns
HashSet<String> set = map.get(firstFourColumns);
// If set is null, then the map hasn't seen these columns before
if (set==null)
{
// Make a new Set (for aggregation), and add it to the map
set = new HashSet<String>();
map.put(firstFourColumns, set);
}
// At this point we either found set or created it ourselves
String lastColumn = line.substring(lastTab+1);
set.add(lastColumn);
}
bigFile.close();
// A demo that shows how to iterate over the map and set structures
for (Map.Entry<String, HashSet<String>> entry : map.entrySet())
{
String firstFourColumns = entry.getKey();
System.out.print(firstFourColumns + "=");
HashSet<String> aggregatedLastColumns = entry.getValue();
for (String column : aggregatedLastColumns)
{
System.out.print(column + ",");
}
System.out.println("");
}
}
A few points:
The initialCapaticy parameter for the HashMap is important. If the number of entries gets bigger than the capacity, then the structure is re-hashed, which is very slow. The default initial capacity is 16, which will cause many rehashes for you. Pick a value that you know is greater than the number of unique sets of the first four columns.
If ordered output in the aggregation is important, you can switch the HashSet for a TreeSet.
This implementation will use a lot of memory. If your text file is 2GB, then you'll probably need a lot of RAM in the jvm. You can add the jvm arg -Xmx4096m to increase the maximum heap size to 4GB. If you don't have at least 4GB this probably won't work for you.
This is also a parallelizable problem, so if you're desperate you could thread it. That would be a lot of effort for throw-away code, though. [Edit: This point is likely not true, as pointed out in the comments]

I would sort the whole list on the first four columns, and then traverse through the list knowing that all the duplicates are together. This would give you O(NlogN) for the sort and O(N) for the traverse, rather than O(N^2) for your nested loops.

I would use a HashSet of the records. This can lead to an O(n) timing instead of O(n^2). You can create a class which has each of the fields with one instance per row.
You need to have a decent amount of memory, but 16 to 32 GB is pretty cheap these days.

I would do something similar to Eric's solution, but instead of storing the actual strings in the HashMap, I'd just store line numbers. So for a particular four column hash, you'd store a list of line numbers which hash to that value. And then on a second path through the data, you can remove the duplicates at those line numbers/add the +x as needed.
This way, your memory requirements will be a LOT smaller.

The solutions already posted are nice if you have enough (free) RAM. As Java tends to "still work" even if it is heavily swapping, make sure you don't have too much swap activity if you presume RAM could have been the limiting factor.
An easy "throwaway" solution in case you really have too little RAM is partitioning the file into multiple files first, depending on data in the first four columns (for example, if the third column values are more or less uniformly distributed, partition by the last two digits of that column). Just go over the file once, and write the records as you read them into 100 different files, depending on the partition value. This will need minimal amount of RAM, and then you can process the remaining files (that are only about 20MB each, if the partitioning values were well distributed) with a lot less required memory, and concatenate the results again.
Just to be clear: If you have enough RAM (don't forget that the OS wants to have some for disk cache and background activity too), this solution will be slower (maybe even by a factor of 2, since twice the amount of data needs to be read and written), but in case you are swapping to death, it might be a lot faster :-)

Find a large collection of strings within a larger collection of strings

I have a collection of strings that I want to filter. They'll be in this pattern:
xxx_xxx_xxx_xxx
so always a sequence of letters or numbers separated by three underscores. The max length of each string will be 60 characters. I might have a few million of these in my collection.
What data structure could I use to efficiently do something like this:
Get all strings starts with: "abc_123_456"
Get all strings starts with: "def_999_888"
etc..
for example, I could do this:
List<String> matched = new ArrayList<String>();
for (String it : strings) {
if (it.startsWith(match)) {
matched.add(it);
}
}
but that would take a long time if my collection is on the order of millions of strings, and worse yet if the number of matched strings is also high.
The high-level problem is that I want to answer the following question for an app I'm writing: "which of my friends have recommended product A for product B?". I could store this information in a sql table and run the following statement:
select recommender from recs where username='me' and prodIdA='a' and prodIdB='b';
I'm curious if something custom in java/C/C++ could run faster, using encoded flat strings like I have above:
myusername_prodIdA_prodIdB_recommenderusername
The idea being that you could do a starts-with operation on the whole collection of encoded strings to get your answer.
I know trying to implement a custom solution like this is most likely not usable in a production environment, so some sql db would be better, just curious though,
Thanks

To do that in Java, you can use a Trie structure.
That being said, I don't think it's a good idea. Dumping "a few million" records in the memory won't always work.
That's what databases are for; with the right design and proper indexing you can have very good performance with the DB alone.

I think you are looking for a SortedMap.
"headMap(K toKey)
Returns a view of the portion of this map whose keys are strictly less than toKey."

I know trying to implement a custom solution like this is most likely not usable in a production environment, so some sql db would be better, just curious though
If only for the sake of curiosity, you can put all existing different "myusername_prodIdA_prodIdB" combinations in hashtable. And for each combination store a list of relevant results.
So, the structure would look like Map<String, List<String>> and used like hash.get("def_999_888"). Constant time (O(1))
You can get rid of inner list and optimize it in many ways, but this is the idea.

The first thing that comes to mind for me is pre-processing the strings into some sort of data structure so that they could be searched for efficiently. If you're going to be calling the search function many times, I think it'd be good for you to put all of the strings into a hash table for a constant-time look up. It'd take more processing power to construct your array of strings, but it'd trivialize the task of searching for them.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java: optimize hashset for large-scale duplicate detection - java

Related

Implementing efficient data structure using Arrays only

Efficient Intersection and Union of Lists of Strings

Looking for memory efficient design

How to delete duplicate/aggregate rows faster in a file using Java (no DB)

Find a large collection of strings within a larger collection of strings

Categories

Resources