Which datastructure to use? - java

I have a scenario like this,
I need to store the count of the strings and I need to return the top ten strings which has the max count,
for example,
String Count
---------------------------------
String1 10
String2 9
String3 8
.
.
.
String10 1
I was thinking of using the hashtable for storing the string and its count but it would be difficult to retrieve the top ten strings from it as I have to again loop through it to find them.
Any other suggestions here?

Priority Que.
You can make a class to put in it:
public class StringHolder{
private String string;
private int value;
//Compare to and equals methods
}
then it will sort when you insert and it is easy to get the top 10.

Simply use a sorted map like
Map<Integer, List<String>> strings
where the key is the frequency value and the value the list of strings that occur with that frequency.
Then, loop through the map and, with an inner loop through the value lists until you've seen 10 strings. Those are the 10 most frequent one's.
With the additional requirement that the algorithm should support updates of frequencies: add the strings to a map like Map<String, Integer> where the key is the string and the value the actual frequency (increment the value if you see a string again). After that copy the key/value pairs to the map I suggested above.

For any task like "find N top items" priority queues are perfect solution. See Java's PriorityQueue class.

I am not sure, but i guess that the most suitable and elegant class for your needs is guava's
http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/collect/TreeMultiset.html

Guava has a HashMultiset which would be very useful for this.
HashMultiset<String> ms = Hashmultiset.create();
ms.add(astring);
ms.add(astring, times);
ImmutableMultiset<String> ims = Multisets.copyHighestCountFirst(ms);
// iterator through the first 10 elements, and they will be your top 10
// from highest to lowest.

You need a Max Heap data structure for this. Put it all into the max heap, and make successive 10 (or whatever n) removals.
And if you intend to keep reusing the data once it's loaded into memory, it might be worth the expense of sorting by value instead of a heap.

Related

How to obtain counts of each of the elements of the list?

Given a sorted list of something (a,a,b,c,c)
What would be the most efficient way to recognize that a exists in the list 2 times, b once and c 2 times?
Aside from obvious making a map of counts. Can we do better then this?
if (map.containsKey(key)) {
map.put(key, map.get(key) + 1);
} else {
map.put(key, 1);
}
Ultimately the goal is to iterate of the list and know at any given point how many times a key was seen before. Putting things in a map, seems like a step we don't really need.
I would use a Multiset implementation in Guava - probably a HashMultiset. That avoids having to do a put/get on each iteration - if the item already exists when you add it, it just increments the count. It's a bit like using a HashMap<Foo, AtomicInteger>.
See the Guava User's Guide entry on Multiset for more details.
Your method, at each iteration, makes
one lookup for containsKey
one lookup for get
one unboxing from Integer to int
one boxing from int to Integer
one put
You could simply compare the current element to the previous one, increment a count if it's equal, and put the count if not (and reset the counter to 1).
But even if you keep your algorithm, using get and compare the result to null would at least avoid an unnecessary lookup.

Data structure for holding sets of interchangeable strings

I have a set of strings. Out of these, groups of 2 or more may represent the same thing. These groups should be stored in a way that given any member of the group, you can fetch other members of the group with high efficiency.
So given this initial set: ["a","b1","b2","c1","c2","c3"] the result structure should be something like ["a",["b1","b2"],["c1","c2","c3"]] and Fetch("b") should return ["b1","b2"].
Is there a specific data structure and/or algorithm for this purpose?
EDIT: "b1" and "b2" are not actual strings, they're indicating that the 2 belong to the same group. Otherwise a Trie would be a perfect fit.
I may be misinterpreting the initial problem setup, but I believe that there is a simple and elegant solution to this problem using off-the-shelf data structures. The idea is, at a high level, to create a map from strings to sets of strings. Each key in the map will be associated with the set of strings that it's equal to. Assuming that each string in a group is mapped to the same set of strings, this can be done time- and space-efficiently.
The algorithm would probably look like this:
Construct a map M from strings to sets of strings.
Group all strings together that are equal to one another (this step depends on how the strings and groups are specified).
For each cluster:
Create a canonical set of the strings in that cluster.
Add each string to the map as a key whose value is the canonical set.
This algorithm and the resulting data structure is quite efficient. Assuming that you already know the clusters in advance, this process (using a trie as the implementation of the map and a simple list as the data structure for the sets) requires you to visit each character of each input string exactly twice - once when inserting it into the trie and once when adding it to the set of strings equal to it, assuming that you're making a deep copy. This is therefore an O(n) algorithm.
Moreover, lookup is quite fast - to find the set of strings equal to some string, just walk the trie to find the string, look up the associated set of strings, then iterate over it. This takes O(L + k) time, where L is the length of the string and k is the number of matches.
Hope this helps, and let me know if I've misinterpreted the problem statement!
Since this is Java, I would use a HashMap<String, Set<String>>. This would map each string to its equivalence set (which would contain that string and all others that belong to the same group). How you would construct the equivalence sets from the input depends on how you define "equivalent". If the inputs are in order by group (but not actually grouped), and if you had a predicate implemented to test equivalence, you could do something like this:
boolean differentGroups(String a, String b) {
// equivalence test (must handle a == null)
}
Map<String, Set<String>> makeMap(ArrayList<String> input) {
Map<String, Set<String>> map = new HashMap<String, Set<String>>();
String representative = null;
Set<String> group;
for (String next : input) {
if (differentGroups(representative, next)) {
representative = next;
group = new HashSet<String>();
}
group.add(next);
map.put(next, group);
}
return map;
}
Note that this only works if the groups are contiguous elements in the input. If they aren't you'll need more complex bookkeeping to build the group structure.

Get List common value count

I have two ArrayList<Long> with huge size about 5,00,000 in each. I have tried using for loop which usage list.contains(object), but it takes too much time. I have tried by splitting one list and comparing in multiple threads but no effective result found.
I need the no. of elements that are same in both list.
Any optimized way?
Let l1 be the first list and l2 the second list. In Big O notation, that runs in O(l1*l2)
Another approach could be to insert one list into a HashSet, then for all other elements in the other list test if it exist in the HashSet. This would give roughly 2*l1+l2 -> O(l1+l2)
Have you considered putting you elements into a HashSet instead? This would make the lookups much faster. This would of course only work if you don't have duplicates.
If you have duplicates you could construct HashMap that has the value as the key and the count as the value.
General mechanism would be to sort both lists and then iterate the sorted lists looking for matches.
A list isn't a efficient data structure when you have much elements, you have to use a data structure more efficent when you search a element.
For example an tree or a hashmap!
Let us assume that list one has m elements and list two has n elements , m>n. If elements are not numerically ordered , it seems that they are not , total number of comparison steps - that is the cost of the method - factor mxn - n^2/2. In this case cost factor is about 50000x49999.
Keeping both lists ordered will be the optimal solution. If lists are ordered , cost of comparison of these will be factor m. In this case that is about 50000. This optimal result will be achieved , when both of lists are iterated via two cursor. This method can be represented in code as follows :
int i=0,j=0;
int count=0;
while(i<List1.size() && j<List2.size())
{
if(List1[i]==List2[j])
{
count++;
i++;
}
else if(List1[i]<List2[j])
i++;
else
j++;
}
If it is possible for you to keep lists ordered all the time , this method will make difference. Also I consider that it is not possible split and compare unless lists are ordered.

Find and list duplicates in an unordered array consisting of 10,000,000,00 elements

How can duplicate elements in an array, that consists of
unordered 10,000,000,00 elements, be determined? How can they be listed?
Please ensure the performance is taken care of while writing the logic of Java code.
What is the space complexity and time complexity of the logic?
Consider an example array, DuplicateArray[], as shown below.
String DuplicateArray[] = {"tom","wipro","hcl","Ibm","rachael","tom","wipro","hcl","Ibm","rachael",
"Bill","HP","hcl","Ibm","rachael","tom","wipro","hcl","Ibm","rachael","tom","wipro","hcl","Ibm","rachael",
"Bill","HP","hcl","Ibm","rachael","tom","wipro","hcl","Ibm","rachael","tom","wipro","hcl","Ibm","rachael",
"Agnus","wipro","hcl","Ibm","rachael","tom","wipro","hcl","Ibm","rachael","tom","wipro","hcl","Ibm","rachael",
"Obama","wipro","hcl","Ibm","rachael","tom","wipro","hcl","Ibm","rachael","rachael","tom","wipro","hcl","Ibm","rachael",
"Obama","HP","TCS","CTS","rachael","tom","wipro","hcl","Ibm","rachael","rachael","tom","wipro","hcl","Ibm","rachael"}
I suggest you to use Set. Best for you will be HashSet. Put your elements to it one by one. And check existence in every insert operation.
Something like this:
HashSet<String>hs = new HashSet<String>();
HashSet<String>Answer = new HashSet<String>();
for(String s: DuplicateArray){
if(!hs.contains(s))
hs.add(s);
else
Answer.add(s);
}
Code depends on the the assumption, that type of elements of your array is String
Here you go
class MyValues{
public int i = 1;
private String value = null;
public MyValues(String v){
value = v;
}
int hashCode()
{
return value.length;
}
boolean equals(Object obj){
return obj.equals(value);
}
}
Now iterate for duplicates
private Set<MyValues> values = new TreeSet<MyValues>();
for(String s : duplicatArray){
MyValues v = new MyValues(s);
if (values.add(v))
{
v.i++;
}
}
Time and space are both linear.
How many duplicates are expected? A few or comparable to the number of entries or something in between?
Do you know anything else about the values? E.g are they from some specific dictionary?
If not, iterate over the array, build a HashSet, noting when you are about to add an entry that's already there and keeping those in a list. I can't see anything else is going to be faster.
Firstly, do you mean 10,000,000,00 as one billion or 10 billion. If you mean the later, you cannot have more than 2 billion elements in an array or a Set. The suggestions you have so far will not work in this situation. To have 10 billion Strings in memory you will need at least 640 GB and AFAIK, there is not server available which will allow this volume of memory in a single JVM.
For a task this large, you may have to consider a solution which breaks up the work, either across multiple machines or put the work into files to be processed later.
You have to either assume;
You have a relatively small number of unique Strings. In this case, you can built a Set in memory of the words you have seen so far. These will fit into memory. (Or you might assume they do)
Break up the files into manageable sizes. A simple way to do this would be to write to a few hundred work files based on hashcode. The hashcode for the same strings will be the same so as you process each file in memory, you know that it will contain all the duplicates, if there are any.

nth item of hashmap

HashMap selections = new HashMap<Integer, Float>();
How can i get the Integer key of the 3rd smaller value of Float in all HashMap?
Edit
im using the HashMap for this
for (InflatedRunner runner : prices.getRunners()) {
for (InflatedMarketPrices.InflatedPrice price : runner.getLayPrices()) {
if (price.getDepth() == 1) {
selections.put(new Integer(runner.getSelectionId()), new Float(price.getPrice()));
}
}
}
i need the runner of the 3rd smaller price with depth 1
maybe i should implement this in another way?
Michael Mrozek nails it with his question if you're using HashMap right: this is highly atypical scenario for HashMap. That said, you can do something like this:
get the Set<Map.Entry<K,V>> from the HashMap<K,V>.entrySet().
addAll to List<Map.Entry<K,V>>
Collections.sort the list with a custom Comparator<Map.Entry<K,V>> that sorts based on V.
If you just need the 3rd Map.Entry<K,V> only, then a O(N) selection algorithm may suffice.
//after edit
It looks like selection should really be a SortedMap<Float, InflatedRunner>. You should look at java.util.TreeMap.
Here's an example of how TreeMap can be used to get the 3rd lowest key:
TreeMap<Integer,String> map = new TreeMap<Integer,String>();
map.put(33, "Three");
map.put(44, "Four");
map.put(11, "One");
map.put(22, "Two");
int thirdKey = map.higherKey(map.higherKey(map.firstKey()));
System.out.println(thirdKey); // prints "33"
Also note how I take advantage of Java's auto-boxing/unboxing feature between int and Integer. I noticed that you used new Integer and new Float in your original code; this is unnecessary.
//another edit
It should be noted that if you have multiple InflatedRunner with the same price, only one will be kept. If this is a problem, and you want to keep all runners, then you can do one of a few things:
If you really need a multi-map (one key can map to multiple values), then you can:
have TreeMap<Float,Set<InflatedRunner>>
Use MultiMap from Google Collections
If you don't need the map functionality, then just have a List<RunnerPricePair> (sorry, I'm not familiar with the domain to name it appropriately), where RunnerPricePair implements Comparable<RunnerPricePair> that compares on prices. You can just add all the pairs to the list, then either:
Collections.sort the list and get the 3rd pair
Use O(N) selection algorithm
Are you sure you're using hashmaps right? They're used to quickly lookup a value given a key; it's highly unusual to sort the values and then try to find a corresponding key. If anything, you should be mapping the float to the int, so you could at least sort the float keys and get the integer value of the third smallest that way
You have to do it in steps:
Get the Collection<V> of values from the Map
Sort the values
Choose the index of the nth smallest
Think about how you want to handle ties.
You could do it with the google collections BiMap, assuming that the Floats are unique.
If you regularly need to get the key of the nth item, consider:
using a TreeMap, which efficiently keeps keys in sorted order
then using a double map (i.e. one TreeMap mapping integer > float, the other mapping float > integer)
You have to weigh up the inelegance and potential risk of bugs from needing to maintain two maps with the scalability benefit of having a structure that efficiently keeps the keys in order.
You may need to think about two keys mapping to the same float...
P.S. Forgot to mention: if this is an occasional function, and you just need to find the nth largest item of a large number of items, you could consider implementing a selection algorithm (effectively, you do a sort, but don't actually bother sorting subparts of the list that you realise you don't need to sort because their order makes no difference to the position of the item you're looking for).

Categories

Resources