I have a list of files. I would like to scan through and keep a count of the number of files with the same size. the issue is with filesize which is a long, as we know, hashmap will take in only an object and not a primitive. So using new Long(filesize), I put it into the hashmap. instead of getting a pair of (filesize, count), I got a list of (filesize, 1) due to the fact that each Long obj is unique.
How do I go about building this accumulator?
Any solution for 1.4.2?
You simply do it this way:
Map<Long, Integer> count = new HashMap<Long, Integer>();
for (File file : files) {
long size = file.getTotalSpace();
Integer n = count.get(size);
if (n == null) {
count.put(size, 1);
} else {
count.put(size, n + 1);
}
}
There is some auto-boxing and unboxing going on here.
Instead of using new Long(size) , you should use Long.valueOf(size). that will return the same Long reference that is internally cached, and should also boost performance (not that it will be visible unless you do millions of these new Long() operations).
ps. only works for java 1.5 or above
You can use Trove to store pairs (long,int) - TLongIntHashMap
or you could use AtomicInteger as a mutable integer.
Map<Long, AtomicInteger> count = new HashMap<Long, AtomicInteger>();
for (File file : files) {
long size = file.length(); // getTotalSpace() get the space consumed (e.g. a multiple of 8K) rather the actual file size.
AtomicInteger n = count.get(size);
if (n == null) {
count.put(size, new AtomicInteger(1));
} else {
n.getAndIncrement();
}
}
Expanding on what cletus wrote.
His solution is fine, except it only stores each filesize that you come across and the number of files that have this size. If you ever want to know which files those are this data structure will be useless to you so I don't think cletus solution is quite complete. Instead I would do
Map<Long, Collection<File>> count = new HashMap<Long, Collection<File>>();
for (File file : files) {
long size = file.getTotalSpace();
Collection<File> c = count.get(size);
if (c == null) {
c = new ArrayList<File>(); //or whatever collection you feel comfortable with
count.put(size, c);
}
c.add(file);
}
then you can get the number of files with c.size() and you can iterate through all the files with that number easily without having to run this procedure again.
I think there's more to this, and we'll need more details from you. I'm assuming you know there's definitely more than one file of a given size, otherwise I'd first check to see that that's the case. For all you know, you simply have a lot of files with unique file sizes.
You mentioned:
...due to the fact that each Long obj is unique.
I don't think this is the problem. While this may be true depending on how you are instantiating the Longs, it should not prevent HashMaps from behaving the way you want. As long as the two key objects return the same hashCode() value, and the equals() method say they are equal, your HashMap will not create another entry for it. In fact, it should not be possible for you to see "a list of (filesize, 1)" with the same filesize values (unless you wrote your own Long and failed to implement hashCode()/equals() correctly).
That said, Cletus' code should work if you're using Java 5 or higher, if you're using Java 1.4 or below, you'll need to either do your own boxing/unboxing manually, or look into Apache Commons Collections. Here's the pre-Java 5 version of Cletus' example:
Map count = new HashMap();
for (Iterator filesIter = files.iterator(); filesIter.hasNext();) {
File file = (File)filesIter.next();
long size = file.getTotalSpace();
Integer n = count.get(size);
if (n == null) {
count.put(size, Integer.valueOf(1));
} else {
count.put(size, Integer.valueOf(n.intValue() + 1));
}
}
Related
If I have an article in English, or a novel in English, and I want to count how many times each words appears, what is the fastest algorithm written in Java?
Some people said you can use Map < String, Integer>() to complete this, but I was wondering how do I know what is the key words? Every article has different words and how do you know the "key" words then add one on its count?
Here is another way to do it with the things that appeared in Java 8:
private void countWords(final Path file) throws IOException {
Arrays.stream(new String(Files.readAllBytes(file), StandardCharsets.UTF_8).split("\\W+"))
.collect(Collectors.groupingBy(Function.<String>identity(), TreeMap::new, counting())).entrySet()
.forEach(System.out::println);
}
So what is it doing?
It reads a text file completely into memory, into a byte array to be more precise: Files.readAllBytes(file). This method turned up in Java 7 and allows methods of loading files very fast, however for the price that the file will be completely in memory, costing a lot of memory. For speed however this is a good appraoch.
The byte[] is converted to a String: new String(Files.readAllBytes(file), StandardCharsets.UTF_8) while assuming that the file is UTF8 encoded. Change at your own need. The price is a full memory copy of the already huge piece of data in memory. It may be faster to work with a memory mapped file instead.
The string is split at non-Word charcaters: ...split("\\W+") which creates an array of strings with all your words.
We create a stream from that array: Arrays.stream(...). This by itself does not do very much, but we can do a lot of fun things with the stream
We group all the words together: Collectors.groupingBy(Function.<String>identity(), TreeMap::new, counting()). This means:
We want to group the words by the word themselves (identity()). We could also e.g. lowercase the string here first if you want grouping to be case insensitive. This will end up to be the key in a map.
As a result for storng the grouped values we want a TreeMap (TreeMap::new). TreeMaps are sorted by their key, so we can easily output in alphabetical order in the end. If you do not need sorting you could also use a HashMap here.
As value for each group we want to have the number of occurances of each word (counting()). In background that means that for each word we add to a group we increase the counter by one.
From Step 5 we are left with a Map that maps words to their count. Now we just want to print them. So we access a collection with all the key/value pairs in this map (.entrySet()).
Finally the actual printing. We say that each element should be passed to the println method: .forEach(System.out::println). And now you are left with a nice list.
So how good is this answer? The upside is that is is very short and thus highly expressive. It also gets along with only a single system call that hides behind Files.readAllBytes (or at least a fixed number I am not sure if this really works with a single system call) and System calls can be a bottleneck. E.g. if you are reading a file from a stream, each call to read may trigger a system call. This is significantly reduced by using a BufferedReader that as the name suggests buffers. but stilly readAllBytes should be fastest. The price for this is that it consumes huge amounts of memory. However wikipedia claims that a typical english book has 500 pages with 2,000 characters per page which mean roughly 1 Megabyte which should not be a problem in terms of memory consumption even if you are on a smartphone, raspberry pi or a really really old computer.
This solutions does involve some optimizations that were not possible prior to Java 8. For example the idiom map.put(word, map.get(word) + 1) requires the "word" to be looked up twicte in the map, which is an unnecessary waste.
But also a simple loop might be easier to optimize for the compiler and might save a number of method calls. So I wanted to know and put this to a test. I generated a file using:
[ -f /tmp/random.txt ] && rm /tmp/random.txt; for i in {1..15}; do head -n 10000 /usr/share/dict/american-english >> /tmp/random.txt; done; perl -MList::Util -e 'print List::Util::shuffle <>' /tmp/random.txt > /tmp/random.tmp; mv /tmp/random.tmp /tmp/random.txt
Which gives me a file of about 1,3MB, so not that untypical for a book with most words being repeated 15 times, but in random order to circumvent that this end up to be a branch prediction test. Then I ran the following tests:
public class WordCountTest {
#Test(dataProvider = "provide_description_testMethod")
public void test(String description, TestMethod testMethod) throws Exception {
long start = System.currentTimeMillis();
for (int i = 0; i < 100_000; i++) {
testMethod.run();
}
System.out.println(description + " took " + (System.currentTimeMillis() - start) / 1000d + "s");
}
#DataProvider
public Object[][] provide_description_testMethod() {
Path path = Paths.get("/tmp/random.txt");
return new Object[][]{
{"classic", (TestMethod)() -> countWordsClassic(path)},
{"mixed", (TestMethod)() -> countWordsMixed(path)},
{"mixed2", (TestMethod)() -> countWordsMixed2(path)},
{"stream", (TestMethod)() -> countWordsStream(path)},
{"stream2", (TestMethod)() -> countWordsStream2(path)},
};
}
private void countWordsClassic(final Path path) throws IOException {
final Map<String, Integer> wordCounts = new HashMap<>();
for (String word : new String(readAllBytes(path), StandardCharsets.UTF_8).split("\\W+")) {
Integer oldCount = wordCounts.get(word);
if (oldCount == null) {
wordCounts.put(word, 1);
} else {
wordCounts.put(word, oldCount + 1);
}
}
}
private void countWordsMixed(final Path path) throws IOException {
final Map<String, Integer> wordCounts = new HashMap<>();
for (String word : new String(readAllBytes(path), StandardCharsets.UTF_8).split("\\W+")) {
wordCounts.merge(word, 1, (key, oldCount) -> oldCount + 1);
}
}
private void countWordsMixed2(final Path path) throws IOException {
final Map<String, Integer> wordCounts = new HashMap<>();
Pattern.compile("\\W+")
.splitAsStream(new String(readAllBytes(path), StandardCharsets.UTF_8))
.forEach(word -> wordCounts.merge(word, 1, (key, oldCount) -> oldCount + 1));
}
private void countWordsStream2(final Path tmpFile) throws IOException {
Pattern.compile("\\W+").splitAsStream(new String(readAllBytes(tmpFile), StandardCharsets.UTF_8))
.collect(Collectors.groupingBy(Function.<String>identity(), HashMap::new, counting()));
}
private void countWordsStream(final Path tmpFile) throws IOException {
Arrays.stream(new String(readAllBytes(tmpFile), StandardCharsets.UTF_8).split("\\W+"))
.collect(Collectors.groupingBy(Function.<String>identity(), HashMap::new, counting()));
}
interface TestMethod {
void run() throws Exception;
}
}
The result were:
type length diff
classic 4665s +9%
mixed 4273s +0%
mixed2 4833s +13%
stream 4868s +14%
stream2 5070s +19%
Note that I previously also tested with TreeMaps, but found that the HashMaps were much faster, even if I sorted the output afterwards. Also I changed the tests above after Tagir Valeev told me in the comments below about the Pattern.splitAsStream() method. Since I got strongly varying results I left the tests run for quite a while as you can see by the length in seconds above to get meaningful results.
How I judge the results:
The "mixed" approach which does not use streams at all, but uses the "merge" method with callback introduced in Java 8 does improve the performance. This is something I expected because the classic get/put appraoch requires the key to be looked up twice in the HashMap and this is not required anymore with the "merge"-approach.
To my suprise the Pattern.splitAsStream() appraoch is actually slower compared to Arrays.asStream(....split()). I did have a look at the source code of both implementations and I noticed that the split() call saves the results in an ArrayList which starts with a size of zero and is enlarged as needed. This requires many copy operations and in the end another copy operation to copy the ArrayList to an array. But "splitAsStream" actually creates an iterator which I thought can be queried as needed avoiding these copy operations completely. I did not quite look through all the source that converts the iterator to a stream object, but it seems to be slow and I don't know why. In the end it theoretically could have to do with CPU memory caches: If exactly the same code is executed over and over again the code will more likely be in the cache then actually running on large function chains, but this is a very wild speculation on my side. It may also be something completely different. However splitAsStream MIGHT have a better memory footprint, maybe it does not, I did not profile that.
The stream approach in general is pretty slow. This is not totally unexpected because quite a number of method invocations take place, including for example something as pointless as Function.identity. However I did not expect the difference at this magnitude.
As an interesting side note I find the mixed approach which was fastest quite well to read and understand. The call to "merge" does not have the most ovbious effect to me, but if you know what this method is doing it seems most readable to me while at the same time the groupingBy command is more difficult to understand for me. I guess one might be tempted to say that this groupingBy is so special and highly optimised that it makes sense to use it for performance but as demonstrated here, this is not the case.
Map<String, Integer> countByWords = new HashMap<String, Integer>();
Scanner s = new Scanner(new File("your_file_path"));
while (s.hasNext()) {
String next = s.next();
Integer count = countByWords.get(next);
if (count != null) {
countByWords.put(next, count + 1);
} else {
countByWords.put(next, 1);
}
}
s.close();
this count "I'm" as only one word
General overview of steps:
Create a HashMap<String, Integer>
Read the file one word a time. If it doesn't exist in your HashMap, add it and change the count value assigned to 1. If it exists, increment the value by 1. Read till end of file.
This will result in a set of all your words and the count for each word.
If i were you, I would use one of the implementations of a map<String, int>, like a hashmap. Then as you loop through each word if it already exists just increment the int by one, otherwise add it into the map. At the end you can pull out all of the words, or query it based on a specific word to get the count.
If order is important to you, you could try a SortedMap<String, int> to be able to pring them out in alphabetical order.
Hope that helps!
It is actually classic word-count algorithm.
Here is the solution:
public Map<String, Integer> wordCount(String[] strings) {
Map<String, Integer> map = new HashMap<String, Integer>();
int count = 0;
for (String s:strings) {
if (map.containsKey(s)) {
count = map.get(s);
map.put(s, count + 1);
} else {
map.put(s, 1);
}
}
return map;
}
Here is my solution:
Map<String, Integer> map= new HashMap();
int count=0;
for(int i =0;i<strings.length;i++){
for(int j=0;j<strings.length;j++){
if(strings[i]==strings[j])
count++;
}map.put(strings[i],count);
count=0;
}return map;
So I have the following code that takes the input of two arrays, and apply some queries to match elements from Array1 with elements from Array2, then it returns the number of elements that are similar in the two ArrayLists.
Here is the code I use:
public static void get_ND_Matches() throws IOException{
#SuppressWarnings("rawtypes")
List<String> array1 = new ArrayList<String>();
List<String> array2 = new ArrayList<String>();
array1 = new ArrayList<String>( ClassesRetrieval.getDBpediaClasses());
array2 = new ArrayList<String>( ClassesRetrieval.fileToArrayListYago());
String maxLabel="";
HashMap<String,Integer> map = new HashMap<String,Integer>();
int number;
HashMap<String,ArrayList<String>> theMap = new HashMap<>();
for(String yagoClass:array2){
theMap.put(yagoClass, getListTwo(yagoClass));
System.out.println("Done for : "+yagoClass );
}
for(String dbClass:array1){
ArrayList<String> result = get_2D_Matches(dbClass);
for(Map.Entry<String, ArrayList<String>> entry : theMap.entrySet()){
String yagoClass=entry.getKey();
Set<String> IntersectionSet =Sets.intersection(Sets.newHashSet(result), Sets.newHashSet(entry.getValue()));
System.out.println(dbClass + " and "+ yagoClass+ " = "+ IntersectionSet.size());
number = IntersectionSet.size();
map.put(yagoClass, number);
}
int maxValue=(Collections.max(map.values()));
for(Entry<String, Integer> entry:map.entrySet()){
if(entry.getValue()==maxValue && maxValue != 0){
maxLabel = entry.getKey();
}
if(maxValue==0){
maxLabel = "Nothing in yago";
}
}
System.out.println("-------------------------------");
System.out.println(dbClass+" from DBPEDIA Corresponds to "+ maxLabel);
System.out.println("-------------------------------");
}
}
This code returns for example:
Actor from DBPEDIA Corresponds to Yago_Actor
Album from DBPEDIA Corresponds to Yago_Album
SomeClass from DBPEDIA Corresponds to nothing in Yago
Etc..
Behind the scenes, this code uses getDBpediaClasses and then applies Get_2D_Matches(); method to get an arrayList of results for each class. Each ArrayList resulted is then compared to another ArrayList generated by getListTwo() for each class of fileToArrayListYago().
Now, because of all the calculations made in the background (there are millions of elements in each array), this process takes hours to execute.
I would really like to use concurrency/multithreading to solve this issue. Could anyone show me how to do that?
It makes little sense to parallelize code which is not perfectly clean and optimized. You may get factor 4 on a typical 4-core CPU or nothing at all, depending on whether you choose the part to be paralellized properly. Use of a better algorithm may give you much more.
It's possible that the bottleneck is get_2D_Matches, which you haven't published.
Computing the maximum directly instead of creating a throw-away HashMap<String,Integer> map could save quite some time, and so could moving Sets.newHashSet(result) out of the loop.
You should really reconsider variable naming. With names like map, theMap, and result (for something which is not the method's result), it's hard to find out what's going on.
If you really want to parallelize it, you need to split your overlong method first. Then it's rather simple as the processing of each dbClass can be done independently. Just encapsulate it as a Callable and submit it to an ExecutorService.
However, I'd suggest to clean the code first, then submit it to CR, and then consider parallelizing it.
I'd like to do something using a map value for a given key only if the map contains the given key. Naively I would write:
Map<String, String> myMap = ...;
if(myMap.containsKey(key)) {
String value = myMap.get(key);
// Do things with value
}
The code above looks easy to understand, but from a performance point of view, wouldn't it be better the following code?
Map<String, String> myMap = ...;
String value = myMap.get(key);
if(value != null) {
// Do things with value
}
In the second snippet I don't like the fact that value is declared with a wider scope.
How does the performance of given cases change with respect to the Map implementation?
Note: Let's assume that null values are not admitted in the map. I'm not talking about asymptotic complexity here, which is the same for both snippets
Map is an interface, so the implementing classes have quite a bit of freedom in how they implement each operation (it's entirely possible to write a class that buffers the last entry, which may allow constant time access for the get operation if it's the same as the last gotten object, making the two practically equivalent, except for a presumably required comparison).
For TreeMap and HashMap, for example, containsKey is essentially just a get operation (more specifically getEntry) with a check for null.
Thus, for these two containers, the first version should take roughly twice as long as the second (assuming you use the same type of Map in both cases).
Note that HashMap.get is O(1) (with a hash function well-suited to the data) and TreeMap.get is O(log n). So if you do any significant amount of work in the loop, and the Map doesn't contain in the order of millions of elements, the difference in performance is likely to be negligible.
However, note the disclaimer in the docs for Map.get:
If this map permits null values, then a return value of null does not necessarily indicate that the map contains no mapping for the key; it's also possible that the map explicitly maps the key to null. The containsKey operation may be used to distinguish these two cases.
To answer your question,
"How does the performance of given cases change with respect to the Map implementation?"
The performance difference is negligible.
To comment on your comment,
"In the second snippet I don't like the fact that value is declared with a wider scope."
Good, you shouldn't. You see, there are two ways to get null returned from a Map:
The key doesn't exist
OR
The key does exist, but its value is null (if the Map implementation allows null values, like HashMap).
So the two scenarios could actually have different results if the key existed with a null value!
EDIT
I wrote the following code to test out the performance of the two scenarios:
public class TestMapPerformance {
static Map<String, String> myMap = new HashMap<String, String>();
static int iterations = 7000000;
// populate a map with seven million strings for keys
static {
for (int i = 0; i <= iterations; i++) {
String tryIt = Integer.toString(i);
myMap.put(tryIt, "hi");
}
}
// run each scenario twice and print out the results.
public static void main(String[] args) {
System.out.println("Key Exists: " + testMap_CheckIfKeyExists(iterations));
System.out.println("Value Null: " + testMap_CheckIfValueIsNull(iterations));
System.out.println("Key Exists: " + testMap_CheckIfKeyExists(iterations));
System.out.println("Value Null: " + testMap_CheckIfValueIsNull(iterations));
}
// Check if the key exists, then get its value
public static long testMap_CheckIfKeyExists(int iterations) {
Date date = new Date();
for (int i = 0; i <= iterations; i++) {
String key = Integer.toString(i);
if(myMap.containsKey(key)) {
String value = myMap.get(key);
String newString = new String(value);
}
}
return new Date().getTime() - date.getTime();
}
// Get the key's value, then check if that value is null
public static long testMap_CheckIfValueIsNull(int iterations) {
Date date = new Date();
for (int i = 0; i <= iterations; i++) {
String key = Integer.toString(i);
String value = myMap.get(key);
if(value != null) {
String newString = new String(value);
}
}
return new Date().getTime() - date.getTime();
}
}
I ran it and this was the result:
Key Exists: 9901
Value Null: 11472
Key Exists: 11578
Value Null: 9387
So in conclusion, the difference in performance in negligible.
Obviously the 2nd version is more performant: you only lookup the key in the map once while in the first version you look it up twice hence calculating twice the hashcode of the key and looking in the hashbuckets, assuming that you are using a hashmap of course.
You can have a completely different implementation of the Map interface that would be able to handle this kind of code much better by remembering the map entry that was linked to the key in the last contains method call, if the the subsequent get uses the same key (using the == operator) you can then immedialtely return the associated value from the remembered map entry.
However there is a danger in the 2nd method: what if I put this in the map:
map.put("null", null);
then map.get("null") would return null and you would treat it as "null" is not mapped while map.contains("null") would return true !
We can use the old for loop (for(i = 0, j = 0; i<30; i++,j++)) with two variables
Can we use the for-each loop (or the enhanced for loop) in java (for(Item item : items) with two variables? What's the syntax for that?
Unfortunately, Java supports only a rudimentary foreach loop, called the enhanced for loop. Other languages, especially FP ones like Scala, support a construct known as list comprehension (Scala calls it for comprehension) which allows nested iterations, as well as filtering of elements along the way.
No you can't. It is syntactic sugar for using Iterator. Refer here for a good answer on this issue.
You need to have an object that contains both variables.
It can be shown on a Map object for example.
for (Map.Entry<String,String> e: map.entrySet()) {
// you can use e.getKey() and e.getValue() here
}
The following should have the same (performance) effect that you are trying to achieve:
List<Item> aItems = new List<Item>();
List<Item> bItems = new List<Item>();
...
Iterator aIterator = aItems.iterator();
Iterator bIterator = bItems.iterator();
while (aIterator.hasNext() && bIterator.hasNext()) {
Item aItem = aIterator.next();
Item bItem = bIterator.next();
}
The foreach loop assumes that there is only one collection of things. You can do something for each element per iteration. How would you want it to behave that if you could iterate over two collections at once? What if they have different lenghts?
Assuming that you have
Collection<T1> collection1;
Collection<T2> collection2;
You could write an iterable wrapper that iterates over both and returns some sort of merged result.
for(TwoThings<T1, T2> thing : new TwoCollectionWrapper(collection1, collection2) {
// one of them could be null if collections have different length
T1 t1 = thing.getFirst();
T2 t2 = thing.getSecond();
}
That's the closest what I can think of but I don't see much use for that. If both collections are meant to be iterated together, it would be simpler to create a Collection<TwoThings> in the first place.
Besides iterating in parallel you could also want to iterate sequentially. There are implementations for that, e.g. Guava's Iterables.concat()
The simple answer "No" is already given. But you could implement taking two iterators as argument, and returning Pairs of the elements coming from the two iterators. Pair being a class with two fields. You'd either have to implement that yourself, or it is probably existent in some apache commons or similar lib.
This new Iterator could then be used in the foreach loop.
I had to do one task where I need to collect various data from XML and store in SET interface and then output them to a CSV file.
I read the data and stored it in Set interface object as x,y,z.
For CSV file header, I used string buffer to hold the headers
String buffer
StringBuffer buffer = new StringBuffer("");
buffer.append("FIRST_NAME,LAST_NAME,ADDRESS\r\n")
Set<String> x = new HashSet<String>();
Set<String> y = new HashSet<String>();
Set<String> z = new HashSet<String>();
....
Iterator iterator1 = x.iterator()
Iterator iterator2 = y.iterator()
Iterator iterator3 = z.iterator()
while(iterator1.hasNext() && iterator2.hasNext() && iterator3.hasNext()){
String fN = iterator1.next()
String lN = iterator2.next()
String aDS = iterator3.next()
buffer.append(""+fN+","+lN+","+aDS+"\r\n")
}
I know there are already topics on this exact thing but none of them actually answer my question. is there a way to do this?
if I have a TreeMap that uses strings as the keys and objects of the TreeSet class as the values, is there a way that I can add some int to a set that is associated with a specific key?
Well what I'm supposed to do is make a concordance from a text file using the TreeMap and TreeSet class. my plan is this use the TreeMap keys as the words in the text file and the values will be sets of line numbers on which the word appears. So you step through the text file and every time you get a word you check the TreeMap to see if you already have that key and if you don't you add it in and create a new TreeSet of line numbers starting with the one you are on. If you already have it then you just add the line number to the set. So you see what I need to do is access the .add() function of the set
something like
map.get(identifier).add(lineNumber);
I know that doesn't work but how do I do it?
I mean if there is an easier way to do what I'm trying to do I'd be happy to do that instead, but I would still like to know how to do it this way just for you know learning and experience and all that.
Consider the following logic (I assume the input words are in an array):
TreeMap<String, TreeSet<Integer>> index = new TreeMap<String, TreeSet<Integer>>();
for (int pos = 0; pos < input.length; pos++) {
String word = input[pos];
TreeSet<Integer> wordPositions = index.get(word);
if (wordPositions == null) {
wordPositions = new TreeSet<Integer>();
index.put(word, wordPositions);
}
wordPositions.add(pos);
}
This results in the index you need, which maps from strings to the set of positions where the string appears. Depending on your specific needs, the outer/inner data structure can be changed to HashMap/HashSet respectively.
Why not to use a Map of String and ArrayList<int>, something like:
Map<String, List<Integer>> map = new HashMap<String, List<Integer>>();
And then always when you get a word you check if it already exists in the Map and if it does exist you add the line number to the List and if not you create a new entry in the Map for the given word and the given line number.
if (map.get(word ) != null) {
map.get(word).add(line);
}
else{
final List<Integer> list = new ArrayList<Integer>();
list.add(line);
map.put(word, list);
}
If I understand correctly, you want to have a treemap with each key referring to a treeset for storing line number on which the key has appeared. It is definitely doable and implementation is quiet simple. I am not sure why your map.get(identifier).add(lineNumber); is not working. This is how I would do it:
TreeMap<String, TreeSet<Integer>> map = new TreeMap<String, TreeSet<Integer>>();
TreeSet<Integer> set = new TreeSet<Integer>();
set.add(1234);
map.put("hello", set);
map.get("hello").add(123);
It all works fine.
The only reason your construct won't work is because the result of map.get(identifier) can be null. Personally, I like the lazy initialization solution that #EyalSchneider answered with. But there is an alternative if you know all your identifiers ahead of time: for example, if you preload your Map with all known English words. Then you can do something like:
for (String word : allEnglishWords) {
map.put(word, new LinkedList<Integer>);
}
for (int pos = 0; pos < input.length; pos++) {
String word = input[pos];
map.get(word).add(pos);
}