Huge Hashtable sorting - number of values - 553685 - java

I created a hashmap to store occurence of words in multiple files like 10,000 text files. Then i wanted to sort them from hashmap and print top 10 words. Hashmap is defined as,
Hashtable <String, Integer> problem1Counter = new Hashtable<String, Integer> ();
When i kept the files to around 1000, i was able to get top ten words using a simple sorting like this,
String[] keysProblem1 = (String[]) problem1Counter.keySet().toArray(new String[0]);
Integer [] valuesProblem1 = (Integer[])problem1Counter.values().toArray(new Integer[problem1Counter.size()]);
int kk = 0;
String ii = null;
for (int jj = 0; jj < valuesProblem1.length ; jj++){
for (int bb = 0; bb < valuesProblem1.length; bb++){
if(valuesProblem1[jj] < valuesProblem1[bb]){
kk = valuesProblem1[jj];
ii = keysProblem1[jj];
valuesProblem1[jj] = valuesProblem1[bb];
keysProblem1[jj] = keysProblem1[bb];
valuesProblem1 [bb] = kk;
keysProblem1 [bb] = ii;}}}
So the above method is not working when hashtable has more than 553685 values. So can anyone suggest and show a better method to sort them? I'm a newbie to java but had worked in actionscript, so i was a bit comfortable.
Thanks.

Your problem starts when you split up keys and values and try to keep the things at each index connected yourself. Instead, keep them coupled, and sort the Map.Entry objects java gives you.
I'm not sure this compiles, but it should give you a start.
// HashMap and Hashtable are very similar, but I generally use HashMap.
HashMap<String, Integer> answers = ...
// Get the Key/Value pairs into a list so we can sort them.
List<Map.Entry<String, Integer>> listOfAnswers =
new ArrayList<Map.Entry<String, Integer>>(answers.entrySet());
// Our comparator defines how to sort our Key/Value pairs. We sort by the
// highest value, and don't worry about the key.
java.util.Collections.sort(listOfAnswers,
new Comparator<Map.Entry<String, Integer>>() {
public int compare(
Map.Entry<String, Integer> o1,
Map.Entry<String, Integer> o2) {
return o2.getValue() - o1.getValue();
}
});
// The list is now sorted.
System.out.println( String.format("Top 3:\n%s: %d\n%s: %d\n%s: %d", +
listOfAnswers.get(0).getKey(), listOfAnswers.get(0).getValue(),
listOfAnswers.get(1).getKey(), listOfAnswers.get(1).getValue(),
listOfAnswers.get(2).getKey(), listOfAnswers.get(2).getValue()));

For a better way of doing the sort, I'd do it like this:
import java.util.ArrayList;
import java.util.Collections;
import java.util.Comparator;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Map.Entry;
public class Main {
/**
* #param args
*/
public static void main(String[] args) {
HashMap<String, Integer> counter = new HashMap<String, Integer>();
// [... Code to populate hashtable goes here ...]
//
// Extract the map as a list
List<Map.Entry<String, Integer>> entries = new ArrayList<Map.Entry<String, Integer>>(counter.entrySet());
// Sort the list of entries.
Collections.sort(entries, new Comparator<Map.Entry<String, Integer>>() {
#Override
public int compare(Entry<String, Integer> first, Entry<String, Integer> second) {
// This will give a *positive* value if first freq < second freq, zero if they're equal, negative if first > second.
// The result is a highest frequency first sort.
return second.getValue() - first.getValue();
}
});
// And display the results
for (Map.Entry<String, Integer> entry : entries.subList(0, 10))
System.out.println(String.format("%s: %d", entry.getKey(), entry.getValue()));
}
}
Edit explaining why this works
Your original algorithm looks like a variant of Selection Sort, which is an O(n^2) algorithm. Your variant does a lot of extra swapping too, so is quite slow.
Being O(n^2), if you multiply your problem size by 10, it will typically take 100 times longer to run. Sorting half a million elements needs to do 250 billion comparisons, many of which will lead to a swap.
The built-in sort algorithm in Collections#sort is a lightning fast variant of Merge Sort, which runs in O(n.log(n)) time. That means that every time you multiply the problem size by 10, it only takes about 30 times as long. Sorting half a millon elements only needs to do about 10 million comparisons.
This is why experienced developers will advise you to use library functions whenever possible. Writing your own sort algorithms can be great for learning, but it takes a lot of work to implement one as fast and flexible as what's in the library.

create an inner class Word that implements Comparable
override public int compareTo(Word w) to make it use occurrences
create an array of words of the size of your HashMap
fill the array iterating through the HashMap
call Arrays.sort on the array
Alternatively, since you only need the top 10, you can just iterate through your Words and maintain a top 10 list as you go along.

Related

Map sorted on size of value collection [duplicate]

This question already has answers here:
Sort a Map<Key, Value> by values
(64 answers)
Closed 7 years ago.
I'm trying to have a sorted map Map<Integer,Set<Integer>> which keeps elements sorted based on the size() of the value set.
In practice this is a map of a node to the other nodes connected to that node. I want to quickly (O(logn)) access the node with the most edges without having to sort every time.
For example the order should be:
3 => {1,2,4,5}
12 => {1,2,3}
14 => {3,2,3}
65 => {3,8}
6 => {2}
2 => {5}
Since TreeMap won't do it since I can't sort based on values, I probably need to roll something custom.
EDIT: The size of the Set may indeed change which may over-complicate things even more
What would be the simplest way to achieve this?
Here's a sort example how to use two sets for this. One set is sorted by Set::size, and the other is just a normal Map with an integer index. To use this, you have to keep the same key/value pairs in both maps.
I'm not sure if I'd recommend trying to make a single Map out of this. It's got two lookups, by index and by size, so it doesn't really work like a regular map. It will depend on your use model.
package quicktest;
import static java.util.Comparator.comparing;
import java.util.HashSet;
import java.util.Set;
import java.util.TreeMap;
public class TreeMapTest
{
public static void main(String[] args) {
TreeMap<Integer,Set<Integer>> index = new TreeMap<>();
TreeMap<Set<Integer>,Integer> size = new TreeMap<>( comparing( Set::size ) );
for( int i = 0; i < 5; i++ ) {
Set<Integer> set = new HashSet<>();
for( int val = 0; val <= i; val++ ) {
set.add( val );
}
index.put( i, set );
size.put( set, i );
}
System.out.println( size.lastEntry() ); // largest set size
System.out.println( index.get( 2 ) ); // random index
}
}
What about this?
public class MapAndPriority {
Map<Integer, Set<Integer>> sets = new HashMap<Integer, Set<Integer>>();
PriorityQueue<Set<Integer>> byLength = new PriorityQueue<Set<Integer>>(1, new Comparator<Set<Integer>>() {
#Override
public int compare(Set<Integer> o1, Set<Integer> o2) {
// Compare in the reverse order!
return o2.size() - o1.size();
}
});
public void add(int i, Set<Integer> set) {
sets.put(i, set);
byLength.offer(set); // or Add, depending on the behavior you want
}
public Set<Integer> get(int i) {
return sets.get(i);
}
public Set<Integer> mostNodes() {
return byLength.peek();
}
public void remove(int i) {
// sets.remove will return the removed set, so that will be removed from byLength.
// Need to handle case when i does not exist as a key in sets
byLength.remove(sets.remove(i));
}
}
If I understand what you want, then this will:
Add new sets in o(nlog(n))
Regular map get()
Will get the largest set (mostNodes()) in o(log(n))
What I did was to place all sets in a priority queue (along side the map) and then give the priority queue a comparator that compares based on the sizes, so that smaller size is "larger". That way when you call peek() it will return the 'minimum' value in the priority queue, which due to our comparator it will be the longest set.
I didn't deal with all kinds of edge cases (like removing when empty).
You can take a look at the documentation for more details and about the complexity.

having problems with arraylist arrayList<int[]>

Now this the question am trying to answer:
Write a method which takes a sparse array as an argument and returns
a new equivalent dense array.The dense array only needs to be large enough to fit all of the values.For example,the resulting dense array only needs to hold 90 values if the last element in the sparse array is at index 89.
dense array:[3,8,4,7,9,0,5,0] the number are generated randomly.
sparse array is an arraylist of arrays [[0,3],[1,8],[2,4],[3,7],[4,9],[6,5]]
so in the sparse array if the number generated is !0 the value and its index are stored in array of size 2 but if the number generated is 0 nothing is stored
When you have a fixed size for element (as array) in your collection. Your solution is OK and that is a fast way.
But when your element does not have a fixed size, such as: [[1,2,3],[4,5],[6],[7,8,9,10,11]] so you can interator through your element:
for(int[] e : sparseArr)
{
for(int number : e)
{
tree.add(number);
}
}
No matter how many element in your sparseArr, no how long of your element>
To sort your element, I recommend you should use TreeSet<E>, element push into tree will be sorted automatically.
So if you just want to store 2 Integers paired together I recommend going with HashMaps. In your case you would use:
HashMap<Integer, Integer> map = new HashMap<Integer, Integer>();
HashMaps support .containsKey(key); as well as .containsValue(value);
If you want to check all entries you can transform the Map to an entrySet:
for(Entry<Integer, Integer> e : map.entrySet()) {
int one = e.getKey();
int two = e.getValue();
}
Unless you want to do something more special than just storing 2 paired Integers I really can recommend doing it this way!
The method you're after should do something like this
public int[] sparseToDense (ArrayList<int[]> sparse) {
int i = 0;
int[] dense = new int[sparse.get(sparse.size()-1)[0]];
int[] sp;
ListIterator<int[]> iter = sparse.listIterator();
while (iter.hasNext()) {
sp = iter.next();
while (sp[0] != i) {
dense[i++] = 0;
}
dense[i++] = sp[1];
}
return dense;
}
Just another way to do that, since you have java 8, you will be able to use stream. But if you're a beginner, i recommend you to try with for loops and arrays, will be better for your learning.
public static ArrayList<Integer> returnDense(ArrayList<int[]> sparse) {
return sparse.stream().flatMap(p -> IntStream.of(p).boxed())
.collect(Collectors.toCollection(ArrayList::new));
}
also if you decide change int[] to Integer[].
public ArrayList<Integer> returnDense(ArrayList<Integer[]> sparse) {
return sparse.stream().flatMap(p -> Arrays.asList(p).stream()).filter(Objects::nonNull)
.collect(Collectors.toCollection(ArrayList::new));
}
.filter(Objects::nonNull) is to be sure that will not have nulls values, but if you know that will not have it, that isn't necessary.

Counting occurrences of words in an array

I've been working on something which takes a stream of characters, forms words, makes an array of the words, then creates a vector which contains each unique words and the number of times it occurs (basically a word counter).
Anyway I've not used Java in a long time, or much programming to be honest and I'm not happy with how this currently looks. The part I have which makes the vector looks ugly to me and I wanted to know if I could make it less messy.
int counter = 1;
Vector<Pair<String, Integer>> finalList = new Vector<Pair<String, Integer>>();
Pair<String, Integer> wordAndCount = new Pair<String, Integer>(wordList.get(1), counter); // wordList contains " " as first word, starting at wordList.get(1) skips it.
for(int i= 1; i<wordList.size();i++){
if(wordAndCount.getLeft().equals(wordList.get(i))){
wordAndCount = new Pair<String, Integer>(wordList.get(i), counter++);
}
else if(!wordAndCount.getLeft().equals(wordList.get(i))){
finalList.add(wordAndCount);
wordAndCount = new Pair<String, Integer>(wordList.get(i), counter=1);
}
}
finalList.add(wordAndCount); //UGLY!!
As a secondary question, this gives me a vector with all the words in alphabetical order (as in the array). I want to have it sorted by occurrence, the alphabetical within that.
Would the best option be:
Iterate down the vector, testing each occurrence int with the one above, using Collections.swap() if it was higher, then checking the next one above (as its now moved up 1) and so on until it's no longer larger than anything above it. Any occurrence of 1 could be skipped.
Iterate down the vector again, testing each element against the first element of the vector and then iterating downwards until the number of occurrences is lower and inserting it above that element. All occurrences of 1 would once again be skipped.
The first method would doing more in terms of iterating over the elements, but the second one requires you to add and remove components of the vector (I think?) so I don't know which is more efficient, or whether its worth considering.
Why not use a Map to solve your problem?
String[] words // your incoming array of words.
Map<String, Integer> wordMap = new HashMap<String, Integer>();
for(String word : words) {
if(!wordMap.containsKey(word))
wordMap.put(word, 1);
else
wordMap.put(word, wordMap.get(word) + 1);
}
Sorting can be done using Java's sorted collections:
SortedMap<Integer, SortedSet<String>> sortedMap = new TreeMap<Integer, SortedSet<String>>();
for(Entry<String, Integer> entry : wordMap.entrySet()) {
if(!sortedMap.containsKey(entry.getValue()))
sortedMap.put(entry.getValue(), new TreeSet<String>());
sortedMap.get(entry.getValue()).add(entry.getKey());
}
Nowadays you should leave the sorting to the language's libraries. They have been proven correct with the years.
Note that the code may use a lot of memory because of all the data structures involved, but that is what we pay for higher level programming (and memory is getting cheaper every second).
I didn't run the code to see that it works, but it does compile (copied it directly from eclipse)
re: sorting, one option is to write a custom Comparator which first examines the number of times each word appears, then (if equal) compares the words alphabetically.
private final class PairComparator implements Comparator<Pair<String, Integer>> {
public int compareTo(<Pair<String, Integer>> p1, <Pair<String, Integer>> p2) {
/* compare by Integer */
/* compare by String, if necessary */
/* return a negative number, a positive number, or 0 as appropriate */
}
}
You'd then sort finalList by calling Collections.sort(finalList, new PairComparator());
How about using google guava library?
Multiset<String> multiset = HashMultiset.create();
for (String word : words) {
multiset.add(word);
}
int countFoo = multiset.count("foo");
From their javadocs:
A collection that supports order-independent equality, like Set, but may have duplicate elements. A multiset is also sometimes called a bag.
Simple enough?

Java HashTable: What is the most elegant way to clone hashtable on defined interval on inverse order?

What is the most elegant way to copy keys and values from one hashtable to another between start and end keys in inverse order? For example original hashtable is:
[<1,"object1">; <2, "object2">; <4,"object3">; <5,"object4">;<7,"object5">;<8,"object6">]
after calling function getPartListOfNews(2,4) it should return hashtable like this:
[<7,"object5">;<5,"object4">;<4,"object3">]
I had made code to do it and it comes below, but I don't think is this a better way to do what i had described before. Is there ara any better solutions? How can I simplify this code?
public Hashtable<Integer, News> getPartListOfNews(int start, int end){
Hashtable <Integer, News> tempNewsList = new Hashtable <Integer, News>();
int total_to_get = end-start;
int list_size = newsList.size();
Object[] key_array = new Object[list_size];
if(list_size < total_to_get){
return newsList;
}
else{
Enumeration e = newsList.keys();
int index=0;
while(e.hasMoreElements()){
key_array[index] = e.nextElement();
index ;
}
for (int i=end; i>start; i--){
tempNewsList.put((Integer)key_array[i], newsList.get(key_array[i]));
}
return tempNewsList;
}
}
Update:
public Hashtable<Integer, News> newsList = new Hashtable<Integer, News>();
Thanks.
First, you need to use a LinkedHashMap in your newsList attribute, to preserve insertion order. Also, it's better if you declare attributes and return values of methods using the Map interface instead of the concrete class used, in this way you can easily change the implementation, like this:
private Map<Integer, News> newsList = new LinkedHashMap<Integer, News>();
With the above in mind, here's my shot at solving your problem:
public Map<Integer, News> getPartListOfNews(int start, int end) {
// first, get the range of keys from the original map
List<Integer> keys = new ArrayList<Integer>();
for (Integer key : newsList.keySet()) // iterates in insertion order
keys.add(key);
List<Integer> subkeys = keys.subList(start, end);
// now add them in the required order
Map<Integer, News> tempNewsList = new LinkedHashMap<Integer, News>();
ListIterator<Integer> iter = subkeys.listIterator();
while (iter.hasPrevious()) {
Integer key = iter.previous();
tempNewsList.put(key, newsList.get(key));
}
return tempNewsList;
}
First, your code does not have any effect. Hash table "breaks" the order. The order of elements in hash table depends on the particular hash implementation.
There are 2 types of Maps in JDK: HashMap and SortedMap (typically we use its implementation TreeMap). BTW do not use Hashtable: this is old, synchronized and almost obsolete implementation).
When you are using HashMap (and Hashtable) the order of keys is unpredictable: it depends on implementation of hashCode() method of class you are using as keys of your map. If you are using TreeMap you can use Comparator to change this logic.
If you wish your keys to be extracted in the same order you put them use LinkedHashMap.
I think a HashTable is not ordered. If you use a ordered data structure (such as LinkedHashMap) you could sort it (with java build-in methods) and make a sublist. this should be 2 lines of code and very efficiant.

How to find multiples of the same integer in an arraylist?

My problem is as follows. I have an arraylist of integers. The arraylist contains 5 ints e.g[5,5,3,3,9] or perhaps [2,2,2,2,7]. Many of the arraylists have duplicate values and i'm unsure how to count how many of each of the values exist.
The problem is how to find the duplicate values in the arraylist and count how many of that particular duplicate there are. In the first example [5,5,3,3,9] there are 2 5's and 2 3's. The second example of [2,2,2,2,7] would be only 4 2's. The resulting information i wish to find is if there are any duplicates how many of them there are and what specific integer has been duplicated.
I'm not too sure how to do this in java.
Any help would be much appreciated. Thanks.
To me, the most straightforward answer, would be using the Collections.frequency method. Something along the lines of this:
// Example ArrayList with Integer values
ArrayList<Integer> intList = new ArrayList<Integer>();
intList.add(2);
intList.add(2);
intList.add(2);
intList.add(2);
intList.add(7);
Set<Integer> noDupes = new HashSet<Integer>();
noDupes.addAll(intList); // Remove duplicates
for (Integer i : noDupes) {
int occurrences = Collections.frequency(intList, i);
System.out.println(i + " occurs " + occurrences + " times.");
}
If you want to, you could map each Integer with its number of occurrences:
Map<Integer, Integer> map = new HashMap<Integer, Integer>();
for (Integer i : noDupes) {
map.put(i, Collections.frequency(intList, i));
}
Two algorithms spring to mind.
Sort it (Collections.sort). Then iterate through easily finding dupes.
Iterate through keeping count in a Map<Integer,Integer> (or Map<Integer,AtomicInteger> for a mutable count). A bit ugly this way.
Either way, coding it should be an instructive exercise. I suggest doing both, and comparing.
Here is a concrete implementation, with test, of what I described in comments to #Tom's answer:
package playground.tests;
import java.util.HashMap;
import java.util.HashSet;
import java.util.Map;
import java.util.concurrent.atomic.AtomicInteger;
import junit.framework.TestCase;
public class DupeCounterTest extends TestCase {
public void testCountDupes() throws Exception {
int[] array = new int[] { 5, 5, 3, 3, 9 };
assertEquals("{3=2, 5=2}", countDupes(array).toString());
}
private Map<Integer, AtomicInteger> countDupes(int[] array) {
Map<Integer, AtomicInteger> map = new HashMap<Integer, AtomicInteger>();
// first create an entry in the map for every value in the array
for (int i : array)
map.put(i, new AtomicInteger());
// now count all occurrences
for (int i : array)
map.get(i).addAndGet(1);
// now get rid of those where no duplicate exists
HashSet<Integer> discards = new HashSet<Integer>();
for (Integer i : map.keySet())
if (map.get(i).get() == 1)
discards.add(i);
for (Integer i : discards)
map.remove(i);
return map;
}
}
Use a Hashmap collection in addition to the array list where
the Hashmap key is the unique array int value and
the Hashmap value to the key is the count of each value encountered.
Walk your array list collecting these values into the hashmap adding a new item when a previous key does not exist and incrementing by 1 the values of keys that do already exist. Then iterate over the Hashmap and print out any keys where the value is > 1.
You can go through the List and put them in a Map with the count. Then it is easy figure out which one is duplicated.
For a cleaner abstraction of what you're doing, you could use the Multiset data structure from guava/google-collections. You may even find you'd rather use it than a List, depending on what you're doing with it (if you don't need the deterministic ordering of a list). You'd use it like this:
Multiset<Integer> multiset = HashMultiset.create(list);
int count = multiset.count(3); // gets the number of 3s that were in the list
In terms of what the above is doing under the covers, it's almost exactly equivalent to the suggestion of building a Map<Integer,AtomicInteger> based on your list.

Categories

Resources