Java accessing elements in an arraylist slower over time - java

I have some code, and I noticed that the progress of iterating through an ArrayList became drastically slower over time. The code that seems to be causing the problem is as below:
public boolean isWordOfficial(String word){
return this.wordList.get(this.stringWordList.indexOf(word)).isWordOfficial();
}
Is there something about this code I don't know in terms of accessing the two arraylists?

I don't exactly why, or by how much, your ArrayList performance is becoming too slow, but from a quick glance at your use case, you are doing the following operations:
given a String word, look it up in stringWordList, and return the numerical index
lookup the word in wordList contained at this index and return it
This pattern of usage would better be served by a Map, where the key would be the input word, possibly corresponding to an entry in stringWordList, and the output another word, from wordList.
A map lookup would be an O(1) operation, as compared to O(N) for the lookups in a list.

this.stringWordList.indexOf is O(N) and that is the cause of your issue. As N increases (you add words to the list) these operations take longer and longer.
To avoid this keep your list sorted and use binarySearch.
This takes your complexity from O(n) to O(log(N)).

Related

How to quickly know the indexes in a massive ArrayList of a very large number of strings from this ArrayList in Java?

Suppose that I have a collection of 50 million different strings in a Java ArrayList. Let foo be a set of 40 million arbitrarily chosen (but fixed) strings from the previous collection. I want to know the index of every string in foo in the ArrayList.
An obvious way to do this would be to iterate through the whole ArrayList until we found a match for the first string in foo, then for the second one and so on. However, this solution would take an extremely long time (considering also that 50 million was an arbitrary large number that I picked for the example, the collection could be in the order of hundreds of millions or even billions but this is given from the beginning and remains constant).
I thought then of using a Hashtable of fixed size 50 million in order to determine the index of a given string in foo using someStringInFoo.hashCode(). However, from my understanding of Java's Hashtable, it seems that this will fail if there are collisions as calling hashCode() will produce the same index for two different strings.
Lastly, I thought about first sorting the ArrayList with the sort(List<T> list) in Java's Collections and then using binarySearch(List<? extends T> list,T key,Comparator<? super T> c) to obtain the index of the term. Is there a more efficient solution than this or is this as good as it gets?
You need additional data structure that is optimized for searching strings. It will map string to it's index. The idea is that you iterate your original list populating your data structure and then iterate your set, performing searches in that data structure.
What structure should you choose?
There are three options worth considering:
Java's HashMap
TRIE
Java's IdentityHashMap
The first option is simple to implement but provides not the best possible performance. But still, it's population time O(N * R) is better than sorting the list, which is O(R * N * log N). Searching time is better then in sorted String list (amortized O(R) compared to O(R log N).
Where R is the average length of your strings.
The second option is always good for maps of strings, providing guaranteed population time for your case of O(R * N) and guaranteed worst-case searching time of O(R). The only disadvantage of it is that there is no out-of-box implementation in Java standard libraries.
The third option is a bit tricky and suitable only for your case. In order to make it work you need to ensure that strings from the first list are literally used in second list (are the same objects). Using IdentityHashMap eliminates String's equals cost (the R above), as IdentityHashMap compares strings by address, taking only O(1). Population cost will be amortized O(N) and search cost amortized O(1). So this solution provides the best performance and out-of-box implementation. However please note that this solution will work only if there are no duplicates in the original list.
If you have any questions please let me know.
You can use a Java Hashtable with no problems. According to the Java Documentation "in the case of a "hash collision", a single bucket stores multiple entries, which must be searched sequentially."
I think you have a misconception on how hash tables work. Hash collisions do NOT ruin the implementation. A hash table is simply an array of linked-lists. Each key goes through a hash function to determine the index in the array which the element will be placed. If a hash collision occurs, the element will be placed at the end of the linked-list at the index in the hash-table array. See link below for diagram.

find if there exists a common element in 2 arrays

Given two arrays of integers, how can you efficiently find out if the two arrays have an element in common?
Can somebody come up with a better space complexity than this (I would appreciate pointing errors in the program too, thanks!!).
Is it possible to solve this using a XOR?
public boolean findcommon(int[] arr1, int[] arr2) {
Set<int> s = new Hashset<int>();
for(int i=0;i<arr1.length;i++) {
if(!s.contains(arr1[i]))
s.add(arr1[i]);
}
for(int i=0;i<arr2.length;i++) {
if(s.contains(arr2[i]))
return true;
}
return false;
}
Since you are asking for a more space efficient solution:
When you accept a runtime of O(n log n) and are allowed to change the arrays, you could sort them and then do a linear pass to find the common elements.
If you only need to do it ONCE, then you can't do better than time complexity O(n+m) where n and m are the respective lengths of arrays. You have to go through the input in both arrays, there is no choice (how else will you look at all the input?), so just the input processing will have that complexity, there is no point in doing something more efficient. If you need to keep searching as the arrays grow, that's a different discussion.
So the question with your suggested way of implementing comes down to how long does "contains" take? Since you're using a Hashset, contains is constant time O(1), so you get to O(n) to create the hashset and O(m) to verify elements in the second array.
Put together, O(n+m). Good enough ;)
If you're looking for improved space complexity, you first of all need to be able to make changes to the original arrays. But I don't think there's any way to use less additional space than O(n) and still perform in O(n+m) time.
Note: I'm not sure what XOR you are thinking of. If you're thinking of bit-wise or logical XOR, there's no use for it here. If you're thinking of Set XOR, it doesn't matter if it's logically useful or not, as it's not in the implementation of Java Sets, so you would still have to roll your own.
Given that your solution only attempts to find if there is an element that exists in both arrays, the following is code that will do it for you:
public boolean findCommon(int[] arr1, int[] arr2) {
HashTable hash = new HashTable();
for (item : arr1){
if !(hash.containsKey(item)){
hash.put(item, "foo");
}
}
for (item : arr2){
if (hash.containsKey(item)){
return(true);
}
}
return(false);
}
This does still have a worst case complexity of O(n) for two arrays which do not share single element. If as is suggested by your initial question what you're worried about is Space Complexity (eg, you'd be happy to accept a performance hit if you didn't have to store the HashTable), you could go for something along these lines:
public boolean findCommon(int[] arr1, int[] arr2){
for (item : arr1){
for(item2 : arr2){
if(item ==item2){
return(true);
}
}
}
return(false);
}
That would solve your Space Complexity issue, but would have the (objectively terrible) Time Complexity of O(n^2).
This could be simplified if you were to put more parameters around it (Say you happen to know that at least one of the arrays is sorted, or better yet, both are sorted).
But in the wildcard example you asked I believe it really will come down to O(n) with a HashTable for Space Complexity, or O(n^2) with less space complexity.
You can improve space occupation (this is the question right?) with an algoritm of order O(n*m). Just take every pair of elements and compare them... This is awful in time but does not use any additional memory. Otherwise you could sort the two array in place (if you are allowed to modify them) and find the common elements in O(max(n,m)).

How reduce the complexity of the searching in two lists algorithm?

I have to find some common items in two lists. I cannot sort it, order is important. Have to find how many elements from secondList occur in firstList. Now it looks like below:
int[] firstList;
int[] secondList;
int iterator=0;
for(int i:firstList){
while(i <= secondList[iterator]/* two conditions more */){
iterator++;
//some actions
}
}
Complexity of this algorithm is n x n. I try to reduce the complexity of this operation, but I don't know how compare elements in different way? Any advice?
EDIT:
Example: A=5,4,3,2,3 B=1,2,3
We look for pairs B[i],A[j]
Condition:
when
B[i] < A[j]
j++
when
B[i] >= A[j]
return B[i],A[j-1]
next iteration through the list of A to an element j-1 (mean for(int z=0;z<j-1;z++))
I'm not sure, Did I make myself clear?
Duplicated are allowed.
My approach would be - put all the elements from the first array in a HashSet and then do an iteration over the second array. This reduces the complexity to the sum of the lengths of the two arrays. It has the downside of taking additional memory, but unless you use more memory I don't think you can improve your brute force solution.
EDIT: to avoid further dispute on the matter. If you are allowed to have duplicates in the first array and you actually care how many times does an element in the second array match an array in the first one, use HashMultiSet.
Put all the items of the first list in a set
For each item of the second list, test if its in the set.
Solved in less than n x n !
Edit to please fge :)
Instead of a set, you can use a map with the item as key and the number of occurrence as value.
Then for each item of the second list, if it exists in the map, execute your action once per occurence in the first list (dictionary entries' value).
import java.util.*;
int[] firstList;
int[] secondList;
int iterator=0;
HashSet hs = new HashSet(Arrays.asList(firstList));
HashSet result = new HashSet();
while(i <= secondList.length){
if (hs.contains( secondList[iterator]))
{
result.add(secondList[iterator]);
}
iterator++;
}
result will contain required common element.
Algorithm complexity n
Just because the order is important doesn't mean that you cannot sort either list (or both). It only means you will have to copy first before you can sort anything. Of course, copying requires additional memory and sorting requires additional processing time... yet I guess all solutions that are better than O(n^2) will require additional memory and processing time (also true for the suggested HashSet solutions - adding all values to a HashSet costs additional memory and processing time).
Sorting both lists is possible in O(n * log n) time, finding common elements once the lists are sorted is possible in O(n) time. Whether it will be faster than your native O(n^2) approach depends on the size of the lists. In the end only testing different approaches can tell you which approach is fastest (and those tests should use realistic list sizes as to be expected in your final code).
The Big-O notation is no notation that tells you anything about absolute speed, it only tells you something about relative speed. E.g. if you have two algorithms to calculate a value from an input set of elements, one is O(1) and the other one is O(n), this doesn't mean that the O(1) solution is always faster. This is a big misconception of the Big-O notation! It only means that if the number of input elements doubles, the O(1) solution will still take approx. the same amount of time while the O(n) solution will take approx. twice as much time as before. So there is no doubt that by constantly increasing the number of input elements, there must be a point where the O(1) solution will become faster than the O(n) solution, yet for a very small set of elements, the O(1) solution may in fact be slower than the O(n) solution.
OK, so this solution will work if there are no duplicates in either the first or second array. As the question does not tell, we cannot be sure.
First, build a LinkedHashSet<Integer> out of the first array, and a HashSet<Integer> out of the second array.
Second, retain in the first set only elements that are in the second set.
Third, iterate over the first set and proceed:
// A LinkedHashSet retains insertion order
Set<Integer> first = LinkedHashSet<Integer>(Arrays.asList(firstArray));
// A HashSet does not but we don't care
Set<Integer> second = new HashSet<Integer>(Arrays.asList(secondArray));
// Retain in first only what is in second
first.retainAll(second);
// Iterate
for (int i: first)
doSomething();

Java Search an array for a matching string

how can I optimize the following:
final String[] longStringArray = {"1","2","3".....,"9999999"};
String searchingFor = "9999998"
for(String s : longStringArray)
{
if(searchingFor.equals(s))
{
//After 9999998 iterations finally found it
// Do the rest of stuff here (not relevant to the string/array)
}
}
NOTE: The longStringArray is only searched once per runtime & is not sorted & is different every other time I run the program.
Im sure there is a way to improve the worst case performance here, but I cant seem to find it...
P.S. Also would appreciate a solution, where string searchingFor does not exist in the array longStringArray.
Thank you.
Well, if you have to use an array, and you don't know if it's sorted, and you're only going to do one lookup, it's always going to be an O(N) operation. There's nothing you can do about that, because any optimization step would be at least O(N) to start with - e.g. populating a set or sorting the array.
Other options though:
If the array is sorted, you could perform a binary search. This will turn each lookup into an O(log N) operation.
If you're going to do more than one search, consider using a HashSet<String>. This will turn each lookup into an O(1) operation (assuming few collisions).
import org.apache.commons.lang.ArrayUtils;
ArrayUtils.indexOf(array, string);
ArrayUtils documentation
You can create a second array with the hash codes of the string and binary search on that.
You will have to sort the hash array and move the elements of the original array accordingly. This way you will end up with extremely fast searching capabilities but it's going to be kept ordered, so inserting new elements takes resources.
The most optimal would be implementing a binary tree or a B-tree, if you have really so much data and you have to handle inserts it's worth it.
Arrays.asList(longStringArray).contains(searchingFor)

Which is the appropriate data structure?

I need a Java data structure that has:
fast (O(1)) insertion
fast removal
fast (O(1)) max() function
What's the best data structure to use?
HashMap would almost work, but using java.util.Collections.max() is at least O(n) in the size of the map. TreeMap's insertion and removal are too slow.
Any thoughts?
O(1) insertion and O(1) max() are mutually exclusive together with the fast removal point.
A O(1) insertion collection won't have O(1) max as the collection is unsorted. A O(1) max collection has to be sorted, thus the insert is O(n). You'll have to bite the bullet and choose between the two. In both cases however, the removal should be equally fast.
If you can live with slow removal, you could have a variable saving the current highest element, compare on insert with that variable, max and insert should be O(1) then. Removal will be O(n) then though, as you have to find a new highest element in the cases where the removed element was the highest.
If you can have O(log n) insertion and removal, you can have O(1) max value with a TreeSet or a PriorityQueue. O(log n) is pretty good for most applications.
If you accept that O(log n) is still "fast" even though it isn't "fast (O(1))", then some kinds of heap-based priority queue will do it. See the comparison table for different heaps you might use.
Note that Java's library PriorityQueue isn't very exciting, it only guarantees O(n) remove(Object).
For heap-based queues "remove" can be implemented as "decreaseKey" followed by "removeMin", provided that you reserve a "negative infinity" value for the purpose. And since it's the max you want, invert all mentions of "min" to "max" and "decrease" to "increase" when reading the article...
you cannot have O(1) removal+insertion+max
proof:
assume you could, let's call this data base D
given an array A:
1. insert all elements in A to D.
2. create empty linked list L
3. while D is not empty:
3.1. x<-D.max(); D.delete(x); --all is O(1) - assumption
3.2 L.insert_first(x) -- O(1)
4. return L
in here we created a sorting algorithm which is O(n), but it is proven to be impossible! sorting is known as omega(nlog(n)). contradiction! thus, D cannot exist.
I'm very skeptical that TreeMap's log(n) insertion and deletion are too slow--log(n) time is practically constant with respect to most real applications. Even with a 1,000,000,000 elements in your tree, if it's balanced well you will only perform log(2, 1000000000) = ~30 comparisons per insertion or removal, which is comparable to what any other hash function would take.
Such a data structure would be awesome and, as far as I know, doesn't exist. Others pointed this.
But you can go beyond, if you don't care making all of this a bit more complex.
If you can "waste" some memory and some programming efforts, you can use, at the same time, different data structures, combining the pro's of each one.
For example I needed a sorted data structure but wanted to have O(1) lookups ("is the element X in the collection?"), not O(log n). I combined a TreeMap with an HashMap (which is not really O(1) but it is almost when it's not too full and the hashing function is good) and I got really good results.
For your specific case, I would go for a dynamic combination between an HashMap and a custom helper data structure. I have in my mind something very complex (hash map + variable length priority queue), but I'll go for a simple example. Just keep all the stuff in the HashMap, and then use a special field (currentMax) that only contains the max element in the map. When you insert() in your combined data structure, if the element you're going to insert is > than the current max, then you do currentMax <- elementGoingToInsert (and you insert it in the HashMap).
When you remove an element from your combined data structure, you check if it is equal to the currentMax and if it is, you remove it from the map (that's normal) and you have to find the new max (in O(n)). So you do currentMax <- findMaxInCollection().
If the max doesn't change very frequently, that's damn good, believe me.
However, don't take anything for granted. You have to struggle a bit to find the best combination between different data structures. Do your tests, learn how frequently max changes. Data structures aren't easy, and you can make a difference if you really work combining them instead of finding a magic one, that doesn't exist. :)
Cheers
Here's a degenerate answer. I noted that you hadn't specified what you consider "fast" for deletion; if O(n) is fast then the following will work. Make a class that wraps a HashSet; maintain a reference to the maximum element upon insertion. This gives the two constant time operations. For deletion, if the element you deleted is the maximum, you have to iterate through the set to find the maximum of the remaining elements.
This may sound like it's a silly answer, but in some practical situations (a generalization of) this idea could actually be useful. For example, you can still maintain the five highest values in constant time upon insertion, and whenever you delete an element that happens to occur in that set you remove it from your list-of-five, turning it into a list-of-four etcetera; when you add an element that falls in that range, you can extend it back to five. If you typically add elements much more frequently than you delete them, then it may be very rare that you need to provide a maximum when your list-of-maxima is empty, and you can restore the list of five highest elements in linear time in that case.
As already explained: for the general case, no. However, if your range of values are limited, you can use a counting sort-like algorithm to get O(1) insertion, and on top of that a linked list for moving the max pointer, thus achieving O(1) max and removal.

Categories

Resources