Huge Static Array of String - java

Is it a good idea to store words of a dictionary with 100.000 words in a static array of string. I'm working on spellchecker and I thought that way would be faster.

You should generally prefer a Java Collections Framework class to a native Java array for anything non-trivial. In this particular case, what you have is a Set<String> (since no words should appear more than once in the dictionary).
A HashSet<String> offers constant time performance for the basic operations add, remove, and contains, and should work very well with String hashcode formula.
For larger dictionaries, you'd want to use more sophisticated data structures specialized for storing a set of strings (e.g. a trie), but for 100K words, a HashSet should suffice.
See also
Java Tutorials/Collections Framework
Effective Java 2nd Edition, Item 25: Prefer lists to arrays

Definitely its not a good idea to store so many strings as an array especially if you are using it for spell check which means you will have to search for and compare strings. It would make it inefficient to search or compare a string through the array as it would always be a linear search

How about an approach with in memory database technology like for example sqlite inmemory This allows you to use efficient querying without disk overhead

I think 100 000 is not so large amount that search wolud be inefficent. Of course it depends ... It would work nice if you are checking if a word exists in array - it's a linear complexity algorithm. You can keep table ordered so you can use quicksort search algoritm and make it more efficent.
On the other hand - if you wold like to find, 5 most likely words (using N-gram method or something) you should consider using Lucene or other text database.

Perhaps using an SQLite database would be more efficient ? I think that's what firefox/thunderbird does for spell checking but I'm not entirely sure.

You won't be able to store that amount of strings in a static variable. Java has a size limit for static code and even method bodies. Simply use a flatfile and read it upon class instanciation - Java is faster than most people think with those things.
See Enum exeeding the 65535 bytes limit of static initializer... what's best to do?.

Related

Collection type for String search

I have a String that I need to search for in a collection of Strings. I'll need to do searches for multiple representations of the required String(original representation, trimmed, UTF-8 encoded, non ASCII characters encoded). The collection size will be in the order of thousands.
I'm trying to figure out what's the best representation to use for the collection in order to have the best performance:
ArrayList - iterate over the array and check if any of the elements match any of the Strings representations
HashMap - check if map contains any of my Strings representation
Any other?
Generally speaking, HashMap (or any other hashtable-based data structure) is much more preferred for "lookup" exercise. The reason is simple, those data structures support lookup in constant time (independent of collection size).
But... in your scenario (single query for collection), you probably will not gain any performance improvements from using HashMap instead of ArrayList. Reasons:
Putting data inside HashMap will take some time. Not significant time, but comparable to one full pass of the initial list.
Your collection is pretty small - iterating over 5000 of elements is a matter of couple milliseconds (or faster?). Since you need to "search" only once, you will not save much time on that.

Looking for memory efficient design

I'm running some experiments over a large dataset and would like to optimize a particular part. Currently, I have 5-6 Models each of which stores a mapping from Topics to List of Strings. The set of Topics is large and the same between each Model, so there must be a better way. Ultimately the query I need to perform is: what is the String in position x of the List for some Model-Topic combination.
One of the problems with using the mapping method is that if there are say 500k-5M topics, each has a list of 20 strings. Then my Map<Model, Map<Topic, List<String>>> is going to be massive.
Have you tried SortedSet / Maps? Sounds like you need to optimize your search, sorted collections (like TreeMap) should be log(n) while regular list is O(1). Of course, this kind of thing is something at which databases excel...
Not clear where/how you want to achieve "memory efficiency". First one needs to look at the particulars of your detailed data to see how much storage that consumes, then examine various ways of organizing it and analyze their efficiency in terms of % overhead vs your "real" data.
A brief glance shows that a HashMap, when you consider the associated tables, has about 80 bytes of overhead per entry. An ArrayList looks to average out around 10-12. Without looking, I would guess that a TreeMap would be more than a HashMap -- maybe 100.
Generally speaking, links within your own objects will be "cheaper", both in storage and speed to access, than links using these aggregating objects. But the aggregating objects are convenient to use, and have been "optimized" to a degree.
(But looking at your update, you probably should be looking at a DB application, rather than holding everything in heap.)
You could use Topic and Model to construct a composite key in a single Map, e.g.
map.put(topic1_id + model1_id, list1_1);
map.put(topic1_id + model2_id, list1_2);
...
map.get(topic_id + model_id)
where the IDs are Strings (or a similar scheme could be used with numeric identifiers).
A similar approach is to assign each topic and model a unique number, then store the lists of strings in arrays, so looking up the list for a given combination is a matter of looking up two indexes, then accessing a given location in a 2D array. (however, this is easier when you know the number of topics and models in advance of constructing the data structure)
For memory efficiency, also consider the small details. In general, you want to minimise the number of Objects - each Object carries an overhead. ArrayLists can have a lot of wasted space as they grow dynamically, doubling in size when they exceed their current capacity. If you can pre-size them to the required capacity (or use an array instead) then you can save a lot of memory. The same applies when using large numbers of small HashMaps.
One possible data structure is a hierarchy of maps, leading to an array of Strings. E.g.:
HashMap<Model, HashMap<Topic, String[]>> map;
A query function would then look like:
public String query(Model model, Topic topic, int x) {
HashMap<Topic, String[]> childMap = map.get(model);
if (childMap == null) {
return null;
}
String[] list = childMap.get(topic);
if (list == null) {
return null;
}
return list[x];
}
Presuming your Model and Topic structures implement hashCode() and equals() reasonably, the query performance should be quite good.
One potential weakness: I'm assuming you need to index a large number of Model/Topic combinations, and related lists of Strings (if not, you presumably wouldn't be asking about optimization). My guess is that the child String[] arrays will consume a large amount of memory. Each array is a Java object (about 20 bytes) + a pointer at each array location.
2 suggestions there:
1) If many Model/Topic combinations share the same set of Strings, you could gain quite a lot by sharing those String[] instances.
2) If you're using a 64-bit VM, be sure to use compressed ordinary object pointers (-XX:+UseCompressedOops). That will at least keep most of the pointers to 4 bytes instead of 8. Compressed OOPs is the default since 1.6.0_23, so a relatively recent VM will save you some memory here.
One other possibility not mentioned is store the strings using String[][][] and models and topics in a List such as ArrayList and then at query time:
public String query(Model model, Topic topic, int x) {
return strings[models.indexOf(model)][topics.indexOf(topic)][x];
}
It could be further improved for speed if the topics and models were sorted, then binary search rather than indexOf could be used.

Compressed SortedSet<Long> implementation

I need to store a large number of Long values in a SortedSet implementation in a space-efficient manner. I was considering bit-set implementations and discovered Javaewah. However, the API expects int values rather than longs.
Can anyone recommend any alternatives or suggest a good way to solve this problem? I am mainly concerned with space efficiency. Upon building the set I will need to access the minimum and maximum element once. However, access time is not a huge concern (i.e. so a fully run-length encoded implementation will be fine).
EDIT
I should be clear that the implementation does not have to implement the SortedSet interface providing I can access the minimum and maximum elements of the collection.
You could use TLongArrayList which uses a long[] underneath. It supports sort() so the min and max will be the first and last value.
Or you could use a long[] with a length and do this yourself. ;)
This will use about 64 byte more than the raw values themselves. You can get more compact if you can make some assumptions about the range of long values. e.g. if they are actually limited to 48-bit.
You might consider using LongBuffer. If it is memory mapped it avoids using heap or direct memory, but you would have implement a sort routine yourself.
If they are clustered, you might be able to represent the data as a Set of ranges. The ranges could be a pure A - B, or a BitSet with a starting value. The later works well for phone numbers. ;)
Not sure if it has Set or how efficient it is compared to regular JCF, but take a look at this:
http://commons.apache.org/primitives/

Counting repeated words in a file

Goal: to find count of all words in a file. file contains 1000+ words
My approach: use a HashMap<String,Integer>() to store and count the number of times each word appears in the file.
Question:
Would a HashMap() be the best way or would it be better to use a Binary Tree for ensuring faster lookup as there is a large count of words in the file?
Or is there a better way to do this?
HashMap would result in a lot of memory overhead which is not desired.
So you're looking for distinct words?
The most efficient structure I can think of is a Trie
Here's one open source implementation: Google Code patricia-trie
Although I tend to agree with Mitch Wheat -- It sounds like a HashMap should work fine (It's always best to avoid premature optimization... so you should use a HashMap until you've shown that it's a bottleneck)
1000 - 10000 words is very small.
A Hashmap will be fine.
I would recommend doing such a task in Perl/PHP. It's very hard to kill a fly with a machine gun.
A HashMap is perfect. You need to store
a copy of each word encountered
the count for each
A HashMap really won't store much more than that!
Assuming that the strings are not insanely long, a "Trie" approach as Michael suggest would be good. The node in the Trie can store the character and the "count" of the strings that end with that character. This should drastically reduce the storage requirements (again assuming the strings are uniformly distributed and overlapping)
Assuming that the counts are not to be persisted across invocations, while using a HashMap, let the Map be from Integer => Integer - where the "key" is the hashcode of the string and value the count. This should be a efficient solution - with fast lookup and reduced memory foot print.

Find a large collection of strings within a larger collection of strings

I have a collection of strings that I want to filter. They'll be in this pattern:
xxx_xxx_xxx_xxx
so always a sequence of letters or numbers separated by three underscores. The max length of each string will be 60 characters. I might have a few million of these in my collection.
What data structure could I use to efficiently do something like this:
Get all strings starts with: "abc_123_456"
Get all strings starts with: "def_999_888"
etc..
for example, I could do this:
List<String> matched = new ArrayList<String>();
for (String it : strings) {
if (it.startsWith(match)) {
matched.add(it);
}
}
but that would take a long time if my collection is on the order of millions of strings, and worse yet if the number of matched strings is also high.
The high-level problem is that I want to answer the following question for an app I'm writing: "which of my friends have recommended product A for product B?". I could store this information in a sql table and run the following statement:
select recommender from recs where username='me' and prodIdA='a' and prodIdB='b';
I'm curious if something custom in java/C/C++ could run faster, using encoded flat strings like I have above:
myusername_prodIdA_prodIdB_recommenderusername
The idea being that you could do a starts-with operation on the whole collection of encoded strings to get your answer.
I know trying to implement a custom solution like this is most likely not usable in a production environment, so some sql db would be better, just curious though,
Thanks
To do that in Java, you can use a Trie structure.
That being said, I don't think it's a good idea. Dumping "a few million" records in the memory won't always work.
That's what databases are for; with the right design and proper indexing you can have very good performance with the DB alone.
I think you are looking for a SortedMap.
"headMap(K toKey)
Returns a view of the portion of this map whose keys are strictly less than toKey."
I know trying to implement a custom solution like this is most likely not usable in a production environment, so some sql db would be better, just curious though
If only for the sake of curiosity, you can put all existing different "myusername_prodIdA_prodIdB" combinations in hashtable. And for each combination store a list of relevant results.
So, the structure would look like Map<String, List<String>> and used like hash.get("def_999_888"). Constant time (O(1))
You can get rid of inner list and optimize it in many ways, but this is the idea.
The first thing that comes to mind for me is pre-processing the strings into some sort of data structure so that they could be searched for efficiently. If you're going to be calling the search function many times, I think it'd be good for you to put all of the strings into a hash table for a constant-time look up. It'd take more processing power to construct your array of strings, but it'd trivialize the task of searching for them.

Categories

Resources