Collection type for String search

Collection type for String search - java

I have a String that I need to search for in a collection of Strings. I'll need to do searches for multiple representations of the required String(original representation, trimmed, UTF-8 encoded, non ASCII characters encoded). The collection size will be in the order of thousands.
I'm trying to figure out what's the best representation to use for the collection in order to have the best performance:
ArrayList - iterate over the array and check if any of the elements match any of the Strings representations
HashMap - check if map contains any of my Strings representation
Any other?

Generally speaking, HashMap (or any other hashtable-based data structure) is much more preferred for "lookup" exercise. The reason is simple, those data structures support lookup in constant time (independent of collection size).
But... in your scenario (single query for collection), you probably will not gain any performance improvements from using HashMap instead of ArrayList. Reasons:
Putting data inside HashMap will take some time. Not significant time, but comparable to one full pass of the initial list.
Your collection is pretty small - iterating over 5000 of elements is a matter of couple milliseconds (or faster?). Since you need to "search" only once, you will not save much time on that.

Related

Is it faster to hash a long string for comparisons or compare the two strings?

Let's say I have a list of very long strings (40-1000 characters). A user needs to be able to enter a term into the list and the list will report whether the term exists.
Barring storage, is it more efficient to store a hash alongside the long strings, and then when a user attempts a lookup it hashes the input and compares it to a list of hashes?
There are similar answers here, but they aren't quite generalized enough.

Assuming that the data fits in the heap (i.e., in memory), your best bet is to use a Set (or Map if there is data associated with each string). Either change your storage from a List to a Set (using HashSet) or maintain a separate Set if you also really need a List.
The time to compute the hashcode() of a string is proportional to the length of the string. The time to look for the string is constant with respect to the number of strings in the collection (once the hashcode has been computed), assuming a properly-implemented hashcode() and properly-sized Set.
If instead you use equals() on an unsorted list, your lookup time will probably be proportional to the number of items in the list. If you keep the list sorted, you could do binary search with the number of comparisons to lookup one string proportional to the log of the number of items in the list (and each comparison will have to compare characters until a difference is found).
In essence, the Set is sort of like keeping the hashcode of the strings handy, but it goes one step further and stores the data in such a way that it is very quick to jump straight to the elements of the collection that have that hashcode value.
Note that an equals comparison of two strings can bail out as soon as a difference is found, but might have to compare every character in the two strings (when they are equal). If your strings have similar, long prefixes it can hurt performance. Sometimes, you can benefit (performance-wise) from knowledge of the content of your data types. For example, if all your strings begin with the same 1K prefix and only differ in the end, you could benefit from overriding the equals() implementation to compare from the end to the start, so you find differences earlier.

Your question is not specific enough.
First, I assume you mean "I have a set of very long strings", because list is very inefficient structure for presence lookups
Some ideas:
Depends on the properties of your strings' set (i. e. the domain), prefix tree could appear to be dramatically more efficient by memory and speed, than any sort of hash table. Prefix tree means comparisons, not hash computation.
Otherwise, you should end up using some sort of hash table, which means you should compute hash code anyway, at least once for each string. In this case, it seems reasonable to store hash codes along with the strings. But for strict correctness, in the end you should probably compare strings by contents anyway, because hash collisions are possible.
Theoretically, max speed of well-distributed hash functions is 3-4 bytes / clock cycle (i. e. hash function consumes 3-4 bytes per CPU cycle).
Speed of stream comparison - depends on some conditions and how your code is compiled, there are instuctions on modern CPUs that allow to compare up to 16 bytes per cycle. Interesting, that Arrays.equals methods are intrinsified, but there is no "raw" memory comparison method in sun.misc.Unsafe class.

What hashing algorithms to consider for variable length data

To avoid any confusions I am re framing my question based on my research on hashing algorithms
Problem statement
I have multiple text files containing variable length data records. I need find if there are duplicate records in the input. Each of the text files could have data records in millions.
Since I cannot load all the data in memory at once, I plan to create a hash of the key fields in the record when it is processed. Processing a record would mean validating, filtering and transforming it. After processing all the records in all the text files, they are merged to create one view of the whole input (either a text file or a database table).
Finding duplicates would be much faster if a hash value is generated for all the records. If there are collisions of hash values, only those records could be checked for equality (assuming the hashing algorithm is deterministic)
Question - What hash algorithms should I consider for such input i.e. variable length data?

Short Answer
Don't do it. Use the Java map. You can find details here:
http://docs.oracle.com/javase/6/docs/api/java/util/Map.html
Long Answer
You can create a perfect hashing function by treating your string as a number in base-N where N is all of the possible values any character can take on. The problem here is memory. Hashing functions are meant to be used with arrays, which means you'll need an array large enough to handle the results of your hash, and that is impractical.
For instance, take a modest example of a 10 character key. Let's be even more modest and assume they are guaranteed to consist solely of lower-case letters. That gives you 26 possibilities for each character, and 10 characters. This means the possible combinations are:
26 ^ 10 = 141,167,095,653,376
If you look up hashing algorithms, one of the first things they include is collision detection because they acknowledge that collisions are a fact of life.
Now you say you are not loading keys in memory, yet why are you using a hash then? The point of a hash is to give you a mapping onto an array index. Perhaps you're better off using another mechanism.
Possible Solutions
If you are concerned about memory, get some statistics on the duplicates in your file. If you only store a flag to indicate the occurrence of a particular key in the hash, and you have many duplicates, you may be able to just use Java's map. Java's map handles collisions, so that won't keep you from detecting unique keys. You can rest assured that if A[x] is found, that means x is in A, even if x's hash collided with a previous hash.
Next, you could try some utilities to pull out duplicates. Since they would have been written specifically for the purpose, they should be able to handle a large amount of data.
Finally, you could try putting your entries into a database and using that to handle duplicates. This may seem like overkill, but databases are optimized for dealing with very large numbers of records.

This is an extension to the Map idea. Before resorting to this I would check that it cannot be done by simply building a HashSet representing all the strings at once. Remember you can use a 64-bit JVM and set a large heap size.
Define a class StringLocation that contains the data you would need to do a random access to a string on disk - for example, a reference to a RandomAcessFile and an offset within file. If you cannot keep all the files open at once, open and close as needed, caching the RandomAcessFile for the most used files.
Create a HashMap<Integer,List<StringLocation>>.
Start reading the strings. For each string, convert to lower case and obtain its hashCode(), hash, in Integer form. If there is an entry in the Map with hash as key, compare the new string to each string represented in the existing value, doing random file access to get to the already processed strings. Use the String equalsIgnoreCase. If there is a match, you have a duplicate. If there is no match, append a new StringLocation, representing the current string, to the List.
This requires at most two strings to be in memory at a time, the one you are currently processing and a previously processed string with the same hashCode() result to which you are comparing it.
You can further reduce the number of times you have to re-read a string for an equals check by using MessageDigest to generate, for the lower case string, a wide checksum with low risk of collisions, and saving it in the StringLocation object. During a comparison, return false if the checksums do not match, without re-reading the strings.

Hashcode for strings that can be converted to integer

I'm looking for the most effective way of creating hashcodes for a very specific case of strings.
I have strings that can be converted to integer, they vary from 1 to 10,000, and they are very concentrated on the 1-600 range.
My question is what is the most effective way, in terms of performance for retrieving the items from a collection to implement the hashcode for it.
What I'm thinking is:
I can have the strings converted to integer and use a direct acess table (an array of 10.000 rows) - this will be very fast for retrieving but not very smart in terms of memory allocation;
I can use the strings as strings and get a hashcode for it (i wont have to convert it to integer, but i dont know how effective will be the hashcode for the strings in terms of collisions)
Any other ideas are greatly appreciated.
thanks a lot
Thanks everyone for your promptly replies...
There is another information Tha i've forget to add on this. I tink it Will Make this clear if I let you know my final goal with this-I migh not even need a hash table!!!
I just want to validate a stream against a dictiory that is immutable. I want to check if a given tag might or might not be present on my message.
I will receive a string with several pairs tag=value. I want to verify if the tag must or must not be treated by my app.

You might want to consider a trie (http://en.wikipedia.org/wiki/Trie) or radix tree (http://en.wikipedia.org/wiki/Radix_tree). No need to parse the string into an integer, or compute a hash code. You're walking a tree as you walk the string.
Edit:
Both computing a hash code on a string and parsing an integer out of a string involve walking the entire the string, and THEN using that value as a look-up into a specific data structure. Other techniques might involve simultaneously inspecting the string WHILE traversing a data structure. This MIGHT be of value to the poster who asked for "other ideas".

Many collections (e.g. HashMap) already apply a supplemental "rehash" method to help with poor hashcode algorithms. e.g. browse the cource code for HashMap.hash(). And Strings are very common keys, so you can be sure that String.hashCode() is highly optimized. SO, unless you notice a lot of collisions between your hashCodes, I'd go with the standard code.
I tried putting the Strings for 0..600 into a HashSet to see what happened, but it's then pretty tedious to see how many entries had collisions. Look for yourself! If you really really care, copy the source code from HashMap into your own class, edit it so you can get access to the entries (in the Java 6 source code I'm looking at, that would be transient Entry[] table, YMMV), and add methods to count collisions.

If there are only a limited valid range of values, why not represent the collection as a int[10000] as you suggested? The value at array[x] is the number of times that x occurs.
If your strings are represented as decimal integers, then parsing them to strings is a 5-iteration loop (up to 5 digits) and a couple of additions and subtractions. That is, it is incredibly fast. Inserting the elements is effectively O(1), retrieval is O(1). Memory required is around 40kb (4 bytes per int).
One problem is that insertion order is not preserved. Maybe you don't care.
Maybe you could think about caching the hashcode and only updating it if your collection has changed since the last time hashcode() was called. See Caching hashes in Java collections?

«Insert disclaimer about only doing this when it's a hot spot in your application and you can prove it»
Well the integer value itself will be a perfect hash function, you will not get any collisions. However there are two problems with this approach:
HashMap doesn't allow you to specify a custom hash function. So either you'll have to implement you own HashMap or you use a wrapper object.
HashMap uses a bitwise and instead of a modulo operation to find the bucket. This obviously throws bits away since it's just a mask. java.util.HashMap.hash(int) tries to compensate for this but I have seen claims that this is not very successful. Again we're back to implementing your own HashMap.
Now that this point since you're using the integer value as a hash function why not use the integer value as a key in the HashMap instead of the string? If you really want optimize this you can write a hash map that uses int instead of Integer keys or use TIntObjectHashMap from trove.
If you're really interested in finding good hash functions I can recommend Hashing in Smalltalk, just ignore the half dozen pages where the author rants about Java (disclaimer: I know the author).

Java Memory Saving Techniques?

I have this 4 Dimensional array to store String values which are used to create a map which is then displayed on screen with the paintComponent. I have read many articles saying that using huge arrays is very inefficient. (Especially since the array dimensions are 16x16x3x3) I was wondering if there was any way to store the string values (I use them as ID values) differently to save memory or reduce fetching time. If you have any ideas or methods I would appreciate it. Thanks!

Well, if your matrix is full, that is every element contains data, then I believe an array is optimally efficient. But if your matrix is sparse, you could look into using more Linking-based data types.
the first thing I would do is to not use strings as IDs, use ints. It'll reduce the size of your structure a lot.
Also, that array really isn't that big, I wouldn't worry about efficiency if that's the only data structure you have. It's only 2304 elements large.

First off, 16*16*3*3 = 2304 - quite modest really. At this size I'd be more worried about the confusion likely to be caused by a 4D array than the size it is taking!
As others have said, if it fully populated, arrays are ok. If it has gaps, an ArrayList or similar would be better.
If the Strings are just IDs, why not store an enum (or even Integers) instead of a string?

Keep in mind that the String values are separate from the array. The array itself takes the same memory space regardless of what string values it links to. Accessing a specific address in your array will take the same amount of time regardless of what type of object you have saved there, or what the value of that object is.
However, if you find that many of your string values represent exactly the same string, you can avoid having multiple copies of the same string by leveraging String.intern(). If you store the interned string, and you don't have any other references to the non-interned string, that frees the non-interned string up to be garbage-collected. Your array will then have multiple entries that point to the same memory space, rather than different memory addresses with equivalent string objects.
See also:
Is it good practice to use java.lang.String.intern()?
http://www.codeinstructions.com/2009/01/busting-javalangstringintern-myths.html
Depending on the requirements of your IDs, you may also want to look into using a different data structure than strings. For example, while the array itself would be the same size, storing int values would avoid the need to allocate extra space for each individual entry.
Also, a 4-dimensional array may not be the best data structure for your needs in the first place. Can you describe why you've chosen this data structure for what you're trying to represent?

The Strings only take up the space of a reference in each array element. There could be a savings if the strings come from a very small set of values. A more important question is are your 4-dimensional arrays sparse or mostly filled? If you have very few values actually specified then you might have a big savings replacing the 4-d array with a Map from the indicies to the String. Let me know if you want a code sample.

Do you actually have a 4D array of 16x16x3x3 (i.e. 2k) string objects? That doesn't sound that big to me. An array is the most efficient way to store a collection of objects, in terms of memory. An ArrayList can be slightly less efficient (up to 50% wasted space).
The only other way I can think of is to store the Strings end-to-end in one giant String and then use substring() to get the bit you need from that, but you would still need to store the indexes somewhere.
Are you running out of memory? If so check that the Strings in your array are the size you think they are - the backing array of a String instance in Java can be much larger than the string itself. If you do subString() on a 1 GB string, the returned string instance shares the 1 GB array of the first string so will keep it from being GC'd longer than you might expect.

What is the general purpose of using hashtables as a collection? [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
What exactly are hashtables?
I understand the purpose of using hash functions to securely store passwords. I have used arrays and arraylists for class projects for sorting and searching data. What I am having trouble understanding is the practical value of hashtables for something like sorting and searching.
I got a lecture on hashtables but we never had to use them in school, so it hasn't clicked. Can someone give me a practical example of a task a hashtable is useful for that couldn't be done with a numerical array or arraylist? Also, a very simple low level example of a hash function would be helpful.

There are all sorts of collections out there. Collections are used for storing and retrieving things, so one of the most important properties of a collection is how fast these operations are. To estimate "fastness" people in computer science use big-O notation which sort of means how many individual operations you have to accomplish to invoke a certain method (be it get or set for example). So for example to get an element of an ArrayList by an index you need exactly 1 operation, this is O(1), if you have a LinkedList of length n and you need to get something from the middle, you'll have to traverse from the start of the list to the middle, taking n/2 operations, in this case get has complexity of O(n). The same comes to key-value stores as hastable. There are implementations that give you complexity of O(log n) to get a value by its key whereas hastable copes in O(1). Basically it means that getting a value from hashtable by its key is really cheap.

Basically, hashtables have similar performance characteristics (cheap lookup, cheap appending (for arrays - hashtables are unordered, adding to them is cheap partly because of this) as arrays with numerical indices, but are much more flexible in terms of what the key may be. Given a continuous chunck of memory and a fixed size per item, you can get the adress of the nth item very easily and cheaply. That's thanks to the indices being integers - you can't do that with, say, strings. At least not directly. Hashes allows reducing any object (that implements it) to a number and you're back to arrays. You still need to add checks for hash collisions and resolve them (which incurs mostly a memory overhead, since you need to store the original value), but with a halfway decent implementation, this is not much of an issue.
So you can now associate any (hashable) object with any (really any) value. This has countless uses (although I have to admit, I can't think of one that's applyable to sorting or searching). You can build caches with small overhead (because checking if the cache can help in a given case is O(1)), implement a relatively performant object system (several dynamic languages do this), you can go through a list of (id, value) pairs and accumulate the values for identical ids in any way you like, and many other things.

Very simple. Hashtables are often called "associated arrays." Arrays allow access your data by index. Hash tables allow access your data by any other identifier, e.g. name. For example
one is associated with 1
two is associated with 2
So, when you got word "one" you can find its value 1 using hastable where key is one and value is 1. Array allows only opposite mapping.

For n data elements:
Hashtables allows O(k) (usually dependent only on the hashing function) searches. This is better than O(log n) for binary searches (which follow an n log n sorting, if data is not sorted you are worse off)
However, on the flip side, the hashtables tend to take roughly 3n amount of space.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.