First of all I want to make clear that the nature of this question is different than other questions which are already posted as per my knowledge. Please let me know if it is not so.
Given
I have a list of names ~3000.
There are ~2500 files which consists of names one at a line (taken from the name list)
Each file contains ~3000 names (and hence ~3000 lines, though avg is 400)
Problem
At a given time I will be provided with 2 files. I have to create a list of names which are common in both files.
Pre Processing
For reducing the time complexity I have done pre processing and sorted the names in all files.
My Approach
Sorted names in the given list and indexed them from 0 to 2999
In each file for each name
Calculated the group number (name_index / 30)
Calculated the group value (For each name in same group calculate (2^(name_index%30)) and add)
Create a new file with same name in the format "groupNumber blankSpace groupValue"
Result
Instead of having ~3000(Though avg is 400) names in each file now I will have maximum 100 lines in each file. Now I will have to check for common group number and then by help of bit manipulation I can find out common names.
Expectation
Can anyone please suggest a shorter and better solution of the problem. I can do pre processing and store new files in my application so that minimum processing is required at the time of finding common names.
Please let me know if I am going in wrong direction to solve the problem. Thanks in advance.
Points
In my approach the size of total files is 258KB (as I have used group names and group values) and if it is kept by names in each line it's size is 573KB. These files have to be stored on mobile device. So I need to decrease the size as far as possible. Also I am looking forward to data compression and I have no idea about how to do that. Please care to explain that also.
Have you tried the following?
Read names 1 at a time from list1, adding them to a hashset.
Read names from list2 one at a time, looking them up in the hashset created from list one. If they are in the hashset, means the name is common to both files.
If you want to preprocess for some extra speed, store the # of names in each list and select the shorter list as list1.
Aha! Given the very low memory requirement you stated in edit, there's another thing you could do.
Although I still think you could go for the solution other answers suggest. A HashSet with 3000 String entries won't get too big. My quick approximation with 16-char Strings suggests something below 400 kB of heap memory. Try it, then go back. It's like 25 lines of code for the whole program.
If the solution eats too much memory, then you could do this:
Sort the names in the files. That's always a good thing to have.
Open both files.
Read a line from both files.
If line1 < line2, read a line from line1, repeat.
If line1 > line2, read a line from line2, repeat.
Else they are the same, add to results. Repeat.
It eats virtually no memory and it's a good place to use a compareTo() method (if you used it to sort the names, that is) and a switch statement, I think.
The size of the files doesn't influence the memory usage at all.
About the data compression - there are lots of tools and algorithms you could use, try this (look at the related questions, too), or this this.
You are attempting to re-implement a Set with a List. Don't do that. Use a Set of names, which will automatically take care of duplications of inserts.
You need to read both files, there is no way around doing that.
// in pseudo-java
Set<String> names1 = new HashSet<String>();
for (String name : file1.getLine().trim()) {
names1.put(name);
}
Set<String> names2 = new HashSet<String>();
for (String name : file2.getLine().trim()) {
names2.put(name);
}
// with this line, names1 will discard any name not in names2
names1.retainAll(names2);
System.out.println(names1);
Assuming you use HashSet as this example does, you will be comparing hashes of the Strings, which will improve performance dramatically.
If you find the performance is not sufficient, then start looking for faster solutions. Anything else is premature optimization, and if you don't know how fast it must run, then it is optimization without setting a goal. Finding the "fastest" solution requires enumerating and exhausting every possible solution, as that solution you haven't checked yet might be faster.
I'm not sure whether I understood your requirements and situation.
You have about 2.500 files, each of 3000 words (or 400?). There are many duplicates of words which occur in multiple files.
Now somebody will ask you, which words have file-345 and file-765 in common.
You could create a Hashmap, where you store every word, and a List of Files, in which the words occur.
If you get File 345 with it's 3000 words (400?), you look it up in the hashmap, and see, where file 765 is mentioned in the list.
However 2 * 3000 isn't that much. If I create 2 lists of Strings in Scala (which runs on the JVM):
val g1 = (1 to 3000).map (x=> "" + r.nextInt (10000))
val g2 = (1 to 3000).map (x=> "" + r.nextInt (10000))
and build the intersection
g1.intersect (g2)
I get the result (678 elements) in nearly no time on an 8 years old laptop.
So how many requests will you have to answer? How often does the input of the files change? If rarely, then reading the 2 files might be the critical point.
How many unique words do you have? Maybe it is no problem at all to keep them all in memory.
Related
I am working on a project where I am processing a lot of tweets; the goal is to remove duplicates as I process them. I have the tweet IDs, which come in as strings of the format "166471306949304320"
I have been using a HashSet<String> for this, which works fine for a while. But by the time I get to around 10 million items I am drastically bogged down and eventually get a GC error, presumably from the rehashing. I tried defining a better size/load with
tweetids = new HashSet<String>(220000,0.80F);
and that lets it get a little farther, but is still excruciatingly slow (by around 10 million it is taking 3x as long to process). How can I optimize this? Given that I have an approximate idea of how many items should be in the set by the end (in this case, around 20-22 million) should I create a HashSet that rehashes only two or three times, or would the overhead for such a set incur too many time-penalties? Would things work better if I wasn't using a String, or if I define a different HashCode function (which, in this case of a particular instance of a String, I'm not sure how to do)? This portion of the implementation code is below.
tweetids = new HashSet<String>(220000,0.80F); // in constructor
duplicates = 0;
...
// In loop: For(each tweet)
String twid = (String) tweet_twitter_data.get("id");
// Check that we have not processed this tweet already
if (!(tweetids.add(twid))){
duplicates++;
continue;
}
SOLUTION
Thanks to your recommendations, I solved it. The problem was the amount of memory required for the hash representations; first, HashSet<String> was simply enormous and uncalled for because the String.hashCode() is exorbitant for this scale. Next I tried a Trie, but it crashed at just over 1 million entries; reallocating the arrays was problematic. I used a HashSet<Long> to better effect and almost made it, but speed decayed and it finally crashed on the last leg of the processing (around 19 million). The solution came with departing from the standard library and using Trove. It finished 22 million records a few minutes faster than not checking duplicates at all. Final implementation was simple, and looked like this:
import gnu.trove.set.hash.TLongHashSet;
...
TLongHashSet tweetids; // class variable
...
tweetids = new TLongHashSet(23000000,0.80F); // in constructor
...
// inside for(each record)
String twid = (String) tweet_twitter_data.get("id");
if (!(tweetids.add(Long.parseLong(twid)))) {
duplicates++;
continue;
}
You may want to look beyond the Java collections framework. I've done some memory intensive processing and you will face several problems
The number of buckets for large hashmaps and hash sets is going to
cause a lot of overhead (memory). You can influence this by using
some kind of custom hash function and a modulo of e.g. 50000
Strings are represented using 16 bit characters in Java. You can halve that by using utf-8 encoded byte arrays for most scripts.
HashMaps are in general quite wasteful data structures and HashSets are basically just a thin wrapper around those.
Given that, take a look at trove or guava for alternatives. Also, your ids look like longs. Those are 64 bit, quite a bit smaller than the string representation.
An alternative you might want to consider is using bloom filters (guava has a decent implementation). A bloom filter would tell you if something is definitely not in a set and with reasonable certainty (less than 100%) if something is contained. That combined with some disk based solution (e.g. database, mapdb, mecached, ...) should work reasonably well. You could buffer up incoming new ids, write them in batches, and use the bloom filter to check if you need to look in the database and thus avoid expensive lookups most of the time.
If you are just looking for the existence of Strings, then I would suggest you try using a Trie(also called a Prefix Tree). The total space used by a Trie should be less than a HashSet, and it's quicker for string lookups.
The main disadvantage is that it can be slower when used from a harddisk as it's loading a tree, not a stored linearly structure like a Hash. So make sure that it can be held inside of RAM.
The link I gave is a good list of pros/cons of this approach.
*as an aside, the bloom filters suggested by Jilles Van Gurp are great fast prefilters.
Simple, untried and possibly stupid suggestion: Create a Map of Sets, indexed by the first/last N characters of the tweet ID:
Map<String, Set<String>> sets = new HashMap<String, Set<String>>();
String tweetId = "166471306949304320";
sets.put(tweetId.substr(0, 5), new HashSet<String>());
sets.get(tweetId.substr(0, 5)).add(tweetId);
assert(sets.containsKey(tweetId.substr(0, 5)) && sets.get(tweetId.substr(0, 5)).contains(tweetId));
That easily lets you keep the maximum size of the hashing space(s) below a reasonable value.
I was reading about algorithmic problem and one was the following:
Having a file with millions of lines of data, there are 2 lines which
are identical. The lines are so long that may not fit in memory. Find
the 2 identical lines.
The solution suggested was to read lines in parts and create hashes for each line.
E.g. you build the hash for line 1 by building the hash of part-1 of line 1 (which can be read in memory) and then hash of part-2 of line 1 up to part-N of line 1.
Store the hashes in file or hashtable. For any same hash values compare the lines. If the lines are the same we solved it.
Although I understand this solution in high level, I have no idea how this could be implemented. How can we associate a hash with a specific line in file? Is this language implementation detail?
E.g. in Java how would we address this?
The real answer is buy more memory. The longest string you can have in Java 2 GB and that will fit in machines these days. You can buy 32 GB for less than $200.
But to solve the problem, I suggest you
find the offset of each line.
find the lines which are the same length (using the difference of offset)
calculate 64-bit or longer hashes of the lines with the same length.
for the lines with the same hash, do a byte-by-byte comparison.
Note: if you don't have enough memory to cache the entire file this will take a very long time. If you have a 32 GB machine and it has a 64 GB file, each pass will take about 20 minutes, and this has multiple passes.
1)Which API to find the offset?
You count the number of bytes you have read, and that is the offset.
2)The real answer is buy more memory Project Managers don't agree on this for real products. Do you have different experience?
I point out to them that I could spend a day which could cost them > $1000 (even if that is not what I get paid) saving $100 of reusable memory if they think that is good use of resources. I let them decide ;)
My 8 year old son has 8 GB in a PC he built as the memory cost me £24. Yet you are right that there are project mangers who think 8 GB is too much for a professional which is costing them that much per hour!? I have 16 GB in PC which I don't use to run anything serious because I do my work on machine with 256 GB. You can buy machines with 2 TB these days which is overkill for most applications. ;)
While i agree the solution is to utilize modern techniques, and leverage how cheap memory is these days, the problem is one meant to exercise the mind and understand how to solve the problem under the given constraints.
The hashing you talked about is rather simple.
The java solution can leverage a few things under the hood which may obscure whats actually going on so i will explain the solution first, and the java implementation second.
Generic Solution:
Hashing, such as SHA1, MD5, etc. generate an integer by compressing the input. Lets say you can only store the first MB of characters in each line.
You would iterate over each line, get the first MB of characters, and pass that into the Hashing algorithm(MD5 for example).
You then map the hash as the key, and a list/array of line numbers as the value.
After the first pass, any lines with a matching first MB of characters will end up with the same hash, and thus in the same list in the map.
To prepare for the second pass, you search the map and cull any lists that contain only one line number.
Then you create a list of line numbers by compiling the line numbers from the remaining entries in the map, these lines will be the only ones checked in the second pass.
Second pass, you pull the Second MB of characters from each line in your line list, hash them and put them in the map in the same fashion as pass one.
Iterate over the entries in the map, culling hash entries that only have one line number.
repeat step two but by incrementing the character block(MB) to coincide with the pass number.
when you reach a pass where you only have one hash with multiple line numbers, and that hash only has two elements, those lines are the two that are the same.
This is essentially a tree search.
Java Method:
Java has a class called HashMap, which automatically hashes the key. by using a
HashMap<String,ArrayList<Integer>>
for your master map, all you have to do each call
map.get(mbBlock).add(lineNumber); of course you should check to see if this is the first time this key has been used so you dont get a null pointer exception.
after each pass, cull the entries containing only one line.
reiterate over the remaining lines until you only have two line numbers left
Get the first k characters of each line, where k is configurable. Do your hash to find several groups of lines that could have identical lines.
Based on the result of the first step, in which you great narrow down the search range, run your algorithm on each smaller groups for the next k characters.
The search range is narrowed down dramatically after each round if not in the worst case.
The trick of algorithm is that breaking big problems into small ones and make full use of the results of previous steps.
I have a 2GB big text file, it has 5 columns delimited by tab.
A row will be called duplicate only if 4 out of 5 columns matches.
Right now, I am doing dduping by first loading each coloumn in separate List
, then iterating through lists, deleting the duplicate rows as it encountered and aggregating.
The problem: it is taking more than 20 hours to process one file.
I have 25 such files to process.
Can anyone please share their experience, how they would go about doing such dduping?
This dduping will be a throw away code. So, I was looking for some quick/dirty solution, to get job done as soon as possible.
Here is my pseudo code (roughly)
Iterate over the rows
i=current_row_no.
Iterate over the row no. i+1 to last_row
if(col1 matches //find duplicate
&& col2 matches
&& col3 matches
&& col4 matches)
{
col5List.set(i,get col5); //aggregate
}
Duplicate example
A and B will be duplicate A=(1,1,1,1,1), B=(1,1,1,1,2), C=(2,1,1,1,1) and output would be A=(1,1,1,1,1+2) C=(2,1,1,1,1) [notice that B has been kicked out]
A HashMap will be your best bet. In a single, constant time operation, you can both check for duplication and fetch the appropriate aggregation structure (a Set in my code). This means that you can traverse the entire file in O(n). Here's some example code:
public void aggregate() throws Exception
{
BufferedReader bigFile = new BufferedReader(new FileReader("path/to/file.csv"));
// Notice the paramter for initial capacity. Use something that is large enough to prevent rehashings.
Map<String, HashSet<String>> map = new HashMap<String, HashSet<String>>(500000);
while (bigFile.ready())
{
String line = bigFile.readLine();
int lastTab = line.lastIndexOf('\t');
String firstFourColumns = line.substring(0, lastTab);
// See if the map already contains an entry for the first 4 columns
HashSet<String> set = map.get(firstFourColumns);
// If set is null, then the map hasn't seen these columns before
if (set==null)
{
// Make a new Set (for aggregation), and add it to the map
set = new HashSet<String>();
map.put(firstFourColumns, set);
}
// At this point we either found set or created it ourselves
String lastColumn = line.substring(lastTab+1);
set.add(lastColumn);
}
bigFile.close();
// A demo that shows how to iterate over the map and set structures
for (Map.Entry<String, HashSet<String>> entry : map.entrySet())
{
String firstFourColumns = entry.getKey();
System.out.print(firstFourColumns + "=");
HashSet<String> aggregatedLastColumns = entry.getValue();
for (String column : aggregatedLastColumns)
{
System.out.print(column + ",");
}
System.out.println("");
}
}
A few points:
The initialCapaticy parameter for the HashMap is important. If the number of entries gets bigger than the capacity, then the structure is re-hashed, which is very slow. The default initial capacity is 16, which will cause many rehashes for you. Pick a value that you know is greater than the number of unique sets of the first four columns.
If ordered output in the aggregation is important, you can switch the HashSet for a TreeSet.
This implementation will use a lot of memory. If your text file is 2GB, then you'll probably need a lot of RAM in the jvm. You can add the jvm arg -Xmx4096m to increase the maximum heap size to 4GB. If you don't have at least 4GB this probably won't work for you.
This is also a parallelizable problem, so if you're desperate you could thread it. That would be a lot of effort for throw-away code, though. [Edit: This point is likely not true, as pointed out in the comments]
I would sort the whole list on the first four columns, and then traverse through the list knowing that all the duplicates are together. This would give you O(NlogN) for the sort and O(N) for the traverse, rather than O(N^2) for your nested loops.
I would use a HashSet of the records. This can lead to an O(n) timing instead of O(n^2). You can create a class which has each of the fields with one instance per row.
You need to have a decent amount of memory, but 16 to 32 GB is pretty cheap these days.
I would do something similar to Eric's solution, but instead of storing the actual strings in the HashMap, I'd just store line numbers. So for a particular four column hash, you'd store a list of line numbers which hash to that value. And then on a second path through the data, you can remove the duplicates at those line numbers/add the +x as needed.
This way, your memory requirements will be a LOT smaller.
The solutions already posted are nice if you have enough (free) RAM. As Java tends to "still work" even if it is heavily swapping, make sure you don't have too much swap activity if you presume RAM could have been the limiting factor.
An easy "throwaway" solution in case you really have too little RAM is partitioning the file into multiple files first, depending on data in the first four columns (for example, if the third column values are more or less uniformly distributed, partition by the last two digits of that column). Just go over the file once, and write the records as you read them into 100 different files, depending on the partition value. This will need minimal amount of RAM, and then you can process the remaining files (that are only about 20MB each, if the partitioning values were well distributed) with a lot less required memory, and concatenate the results again.
Just to be clear: If you have enough RAM (don't forget that the OS wants to have some for disk cache and background activity too), this solution will be slower (maybe even by a factor of 2, since twice the amount of data needs to be read and written), but in case you are swapping to death, it might be a lot faster :-)
Lets say I have 500 words:
Martin
Hopa
Dunam
Golap
Hugnog
Foo
... + 494 more words
I have following text that is about 85KB in total:
Marting went and got him self stuff
from Hopa store and now he is looking
to put it into storage with his best
friend Dunam. They are planing on
using Golap lock that they found in
Hugnog shop in Foo town. >... text continues into several pages
I would like to produce following text:
------- went and got him self stuff
from ---- store and now he is looking
to put it into storage with his best
friend ----. They are planing on
using ---- lock that they found in
------ shop in --- town. >... text continues into several pages
Currently I'm using commons method:
String[] 500words = //all 500 words
String[] maskFor500words = // generated mask for each word
String filteredText = StringUtils.replaceEach(textToBeFiltered, 500words , maskFor500words);
Is there a another way to do this that could be more efficient when it comes to memory and CPU usage?
What is the best storage for the 500 words? File, List, enum, array ...?
How would you get statistics, such as how many and what words were replaced; and for each word how many times it was replaced.
I wouldn't care much apout CPU and memory usage. It should be relatively small for such a problem and such a volume of text.
What I would do is
have a Map containing all the strings as keys, with the numer of times they have been found in the text (initially 0)
read the text word by word, by using a StringTokenizer, or the String.split() method
for each word, find if the map contains it (O(1) operation, very quick)
if it contains it, add "----" to a StringBuilder, and increment the value stored for the word in the map
else add the word itself (with a space before unless it's the first word of the text)
A the end of the process, the StringBuilder contains the result, and the map contains the numer of times each word has been used as a replacement.
Make sure to initialize the STringBuilder with the length of the original text, in order to avoid too many reallocations.
Should be simple and efficient.
I wouldn't care about memory much, but in case you do: trie is your friend. It's memory efficient for large sets and it allows very efficient matching. You may want to implement it in a compressed fashion.
If I understand the problem correctly, you need to read the 85KB of text and parse out every word (use split or StringTokenizer). For every word, you need to know if you have it in the set of 500words, and if so, switch it with the corresponding mask.
If you know you have about 500 words, I'd suggest store the 500 words and their masks in a HashMap with initial capacity of about 650 (JDK doc says hashing is most efficient with a load factor of 0.75). Put in the word-mask pairs in the HashMap with a for loop.
The biggest bang for the buck (HashMap) you get is that the get/put operations (searching for the key) are done in constant time, which is better than O(n) in array and even O(log(n)) if you do binary search on sorted array.
Armed with the HashMap, you can build up a SringBuffer while filtering those 85KB of text.
Return the String.toString() from your method and you are done! Regards, - M.S.
PS If you are building the map at a server and doing the filtering somewhere else (at a client) and need to transport the Dictionary, HashMap won't do - it cannot be serialized. Use a Hashtable in that case. If on the same machine, HashMap is more memory efficient. Later, - M.S.
I'm programming a java application that reads strictly text files (.txt). These files can contain upwards of 120,000 words.
The application needs to store all +120,000 words. It needs to name them word_1, word_2, etc. And it also needs to access these words to perform various methods on them.
The methods all have to do with Strings. For instance, a method will be called to say how many letters are in word_80. Another method will be called to say what specific letters are in word_2200.
In addition, some methods will compare two words. For instance, a method will be called to compare word_80 with word_2200 and needs to return which has more letters. Another method will be called to compare word_80 with word_2200 and needs to return what specific letters both words share.
My question is: Since I'm working almost exclusively with Strings, is it best to store these words in one large ArrayList? Several small ArrayLists? Or should I be using one of the many other storage possibilities, like Vectors, HashSets, LinkedLists?
My two primary concerns are 1.) access speed, and 2.) having the greatest possible number of pre-built methods at my disposal.
Thank you for your help in advance!!
Wow! Thanks everybody for providing such a quick response to my question. All your suggestions have helped me immensely. I’m thinking through and considering all the options provided in your feedback.
Please forgive me for any fuzziness; and let me address your questions:
Q) English?
A) The text files are actually books written in English. The occurrence of a word in a second language would be rare – but not impossible. I’d put the percentage of non-English words in the text files at .0001%
Q) Homework?
A) I’m smilingly looking at my question’s wording now. Yes, it does resemble a school assignment. But no, it’s not homework.
Q) Duplicates?
A) Yes. And probably every five or so words, considering conjunctions, articles, etc.
Q) Access?
A) Both random and sequential. It’s certainly possible a method will locate a word at random. It’s equally possible a method will want to look for a matching word between word_1 and word_120000 sequentially. Which leads to the last question…
Q) Iterate over the whole list?
A) Yes.
Also, I plan on growing this program to perform many other methods on the words. I apologize again for my fuzziness. (Details do make a world of difference, do they not?)
Cheers!
I would store them in one large ArrayList and worry about (possibly unnecessary) optimisations later on.
Being inherently lazy, I don't think it's a good idea to optimise unless there's a demonstrated need. Otherwise, you're just wasting effort that could be better spent elsewhere.
In fact, if you can set an upper bound to your word count and you don't need any of the fancy List operations, I'd opt for a normal (native) array of string objects with an integer holding the actual number. This is likely to be faster than a class-based approach.
This gives you the greatest speed in accessing the individual elements whilst still retaining the ability to do all that wonderful string manipulation.
Note I haven't benchmarked native arrays against ArrayLists. They may be just as fast as native arrays, so you should check this yourself if you have less blind faith in my abilities than I do :-).
If they do turn out to be just as fast (or even close), the added benefits (expandability, for one) may be enough to justify their use.
Just confirming pax assumptions, with a very naive benchmark
public static void main(String[] args)
{
int size = 120000;
String[] arr = new String[size];
ArrayList al = new ArrayList(size);
for (int i = 0; i < size; i++)
{
String put = Integer.toHexString(i).toString();
// System.out.print(put + " ");
al.add(put);
arr[i] = put;
}
Random rand = new Random();
Date start = new Date();
for (int i = 0; i < 10000000; i++)
{
int get = rand.nextInt(size);
String fetch = arr[get];
}
Date end = new Date();
long diff = end.getTime() - start.getTime();
System.out.println("array access took " + diff + " ms");
start = new Date();
for (int i = 0; i < 10000000; i++)
{
int get = rand.nextInt(size);
String fetch = (String) al.get(get);
}
end = new Date();
diff = end.getTime() - start.getTime();
System.out.println("array list access took " + diff + " ms");
}
and the output:
array access took 578 ms
array list access took 907 ms
running it a few times the actual times seem to vary some, but generally array access is between 200 and 400 ms faster, over 10,000,000 iterations.
If you will access these Strings sequentially, the LinkedList would be the best choice.
For random access, ArrayLists have a nice memory usage/access speed tradeof.
My take:
For a non-threaded program, an Arraylist is always fastest and simplest.
For a threaded program, a java.util.concurrent.ConcurrentHashMap<Integer,String> or java.util.concurrent.ConcurrentSkipListMap<Integer,String> is awesome. Perhaps you would later like to allow threads so as to make multiple queries against this huge thing simultaneously.
If you're going for fast traversal as well as compact size, use a DAWG (Directed Acyclic Word Graph.) This data structure takes the idea of a trie and improves upon it by finding and factoring out common suffixes as well as common prefixes.
http://en.wikipedia.org/wiki/Directed_acyclic_word_graph
Use a Hashtable? This will give you your best lookup speed.
ArrayList/Vector if order matters (it appears to, since you are calling the words "word_xxx"), or HashTable/HashMap if it doesn't.
I'll leave the exercise of figuring out why you would want to use an ArrayList vs. a Vector or a HashTable vs. a HashMap up to you since I have a sneaking suspicion this is your homework. Check the Javadocs.
You're not going to get any methods that help you as you've asked for in the examples above from your Collections Framework class, since none of them do String comparison operations. Unless you just want to order them alphabetically or something, in which case you'd use one of the Tree implementations in the Collections framework.
How about a radix tree or Patricia trie?
http://en.wikipedia.org/wiki/Radix_tree
The only advantage of a linked list over an array or array list would be if there are insertions and deletions at arbitrary places. I don't think this is the case here: You read in the document and build the list in order.
I THINK that when the original poster talked about finding "word_2200", he meant simply the 2200th word in the document, and not that there are arbitrary labels associated with each word. If so, then all he needs is indexed access to all the words. Hence, an array or array list. If there really is something more complex, if one word might be labeled "word_2200" and the next word is labeled "foobar_42" or some such, then yes, he'd need a more complex structure.
Hey, do you want to give us a clue WHY you want to do any of this? I'm hard pressed to remember the last time I said to myself, "Hey, I wonder if the 1,237th word in this document I'm reading is longer or shorter than the 842nd word?"
Depends on what the problem is - speed or memory.
If it's memory, the minimum solution is to write a function getWord(n) which scans the whole file each time it runs, and extracts word n.
Now - that's not a very good solution. A better solution is to decide how much memory you want to use: lets say 1000 items. Scan the file for words once when the app starts, and store a series of bookmarks containing the word number and the position in the file where it is located - do this in such a way that the bookmarks are more-or-less evenly spaced through the file.
Then, open the file for random access. The function getWord(n) now looks at the bookmarks to find the biggest word # <= n (please use a binary search), does a seek to get to the indicated location, and scans the file, counting the words, to find the requested word.
An even quicker solution, using rather more memnory, is to build some sort of cache for the blocks - on the basis that getWord() requests usually come through in clusters. You can rig things up so that if someone asks for word # X, and its not in the bookmarks, then you seek for it and put it in the bookmarks, saving memory by consolidating whichever bookmark was least recently used.
And so on. It depends, really, on what the problem is - on what kind of patterns of retreival are likely.
I don't understand why so many people are suggesting Arraylist, or the like, since you don't mention ever having to iterate over the whole list. Further, it seems you want to access them as key/value pairs ("word_348"="pedantic").
For the fastest access, I would use a TreeMap, which will do binary searches to find your keys. Its only downside is that it's unsynchronized, but that's not a problem for your application.
http://java.sun.com/javase/6/docs/api/java/util/TreeMap.html