I have a list of strings. I have a set of numbers:
{1, 2, 3, 4}
and I need to generate all combinations(?) (strings) to check against my list, Combinations:
(1, 2, 3, 4), (1234), (1, 2, 3, 4), (123, 4), (12, 34), (1, 2, 34), (1, 234), (1, 23, 4), (1, 23), (1, 2, 3), (1 2), ((1 2), (3 4))...etc.
This problem grows larger as my set of numbers gets larger. Is it right that this is a bad problem to use recursion for? (that is what I have now) However, aren't the space requirements stricter for an iterative solution, such as the maximum size of lists?
At termination, I need to look at the number of matches, for each result, with my list, and then the number of component parts for each result.. ex. (1) = 1; (1, 2) = 2.
My computer was running out of memory (this is an abstraction of a problem with larger objects)
EDIT: so my question was in a significantly larger context, such as graphics, comparing pixels in a 700 x 500 matrix... My way cannot be the most efficient way to do this..? I need to know the nested structure of the objects and how many pre-exisiting components that comprise them (that are in my list of strings)
EDIT2: The full problem is described here.
If this is the only way to solve your problem (generating all combinations) then it's going to be slow, it doesn't necessarily have to take up a bunch of memory though.
When doing recursion, you'll want to use tail-recursion to optimize memory usage. Or just switch over to an iterative approach.
If you need to save the combinations that match, make sure you only save the combination not copies of the objects themselves.
As a last resort you could always append each matching combination to a file to read in later, then you wouldn't be using much memory at all.
All these things could help with your memory problems.
If you're getting stackoverflow then it is indeed related to using a recursive routine. This can be alleviated by either increasing the stack depth (see this SO question:What is the maximum depth of the java call stack?) or using an iterative method.
However, if you simply ran out of memory then you will need to store the solution to disk as you go or figure out a way to store the combinations in a more compact data-structure.
As a general rule: do not use recursion when iteration is sufficient.
Recursion uses a stack, and when that stack is full you have stack overflow. If you're performing something with a factorial expansion the stack will seldom be big enough for a recursive solution to be viable.
If you can accomplish something by a loop, or a nested loop use that instead.
Furthermore, if you're simply checking each combination against something, you do not need to store the result of all combinations (which will take up enormous memory), instead compare against each combination then throw that generated combination away.
My computer was running out of memory (this is an abstraction of a problem with larger objects)
If your program is running out of stack memory, then you will be getting a StackOverflowError. If you see that, it indicates that your recursive algorithm is either the wrong approach ... or you have a bug in it. (Probably the latter. Your maximum depth of recursion should be the base set size N if your algorithm is correct.)
If your program is running out of heap memory, then you will be getting an OutOfMemoryError. If that is what you see, then the chances are that the problem is that you don't have enough heap memory to hold the set of combinations that you are generating.
I don't think we have enough information to tell you the best way to solve this. If N is going to be large, then it may be that your entire computational approach is intractable.
Related
As part of my programming course I was given an exercise to implement my own String collection. I was planning on using ArrayList collection or similar but one of the constraints is that we are not allowed to use any Java API to implement it, so only arrays are allowed. I could have implemented this using arrays however efficiency is very important as well as the amount of data that this code will be tested with. I was suggested to use hash tables or ordered tress as they are more efficient than arrays. After doing some research I decided to go with hash tables because they seemed easy to understand and implement but once I started writing code I realised it is not as straight forward as I thought.
So here are the problems I have come up with and would like some advice on what is the best approach to solve them again with efficiency in mind:
ACTUAL SIZE: If I understood it correctly hash tables are not ordered (indexed) so that means that there are going to be gaps in between items because hash function gives different indices. So how do I know when array is full and I need to resize it?
RESIZE: One of the difficulties that I need to create a dynamic data structure using arrays. So if I have an array String[100] once it gets full I will need to resize it by some factor I decided to increase it by 100 each time so once I would do that I would need to change positions of all existing values since their hash keys will be different as the key is calculated:
int position = "orange".hashCode() % currentArraySize;
So if I try to find a certain value its hash key will be different from what it was when array was smaller.
HASH FUNCTION: I was also wondering if built-in hashCode() method in String class is efficient and suitable for what I am trying to implement or is it better to create my own one.
DEALING WITH MULTIPLE OCCURRENCES: one of the requirements is to be able to add multiple words that are the same, because I need to be able to count how many times the word is stored in my collection. Since they are going to have the same hash code I was planning to add the next occurrence at the next index hoping that there will be a gap. I don't know if it is the best solution but here how I implemented it:
public int count(String word) {
int count = 0;
while (collection[(word.hashCode() % size) + count] != null && collection[(word.hashCode() % size) + count].equals(word))
count++;
return count;
}
Thank you in advance for you advice. Please ask anything needs to be clarified.
P.S. The length of words is not fixed and varies greatly.
UPDATE Thank you for your advice, I know I did do few stupid mistakes there I will try better. So I took all your suggestions and quickly came up with the following structure, it is not elegant but I hope it is what you roughly what you meant. I did have to make few judgements such as bucket size, for now I halve the size of elements, but is there a way to calculate or some general value? Another uncertainty was as to by what factor to increase my array, should I multiply by some n number or adding fixed number is also applicable? Also I was wondering about general efficiency because I am actually creating instances of classes, but String is a class to so I am guessing the difference in performance should not be too big?
ACTUAL SIZE: The built-in Java HashMap just resizes when the total number of elements exceeds the number of buckets multiplied by a number called the load factor, which is by default 0.75. It does not take into account how many buckets are actually full. You don't have to, either.
RESIZE: Yes, you'll have to rehash everything when the table is resized, which does include recomputing its hash.
So if I try to find a certain value it's hash key will be different from what it was when array was smaller.
Yup.
HASH FUNCTION: Yes, you should use the built in hashCode() function. It's good enough for basic purposes.
DEALING WITH MULTIPLE OCCURRENCES: This is complicated. One simple solution would just be to have the hash entry for a given string also keep count of how many occurrences of that string are present. That is, instead of keeping multiple copies of the same string in your hash table, keep an int along with each String counting its occurrences.
So how do I know when array is full and I need to resize it?
You keep track of the size and HashMap does. When the size used > capacity * load factor you grow the underlying array, either as a whole or in part.
int position = "orange".hashCode() % currentArraySize;
Some things to consider.
The % of a negative value is a negative value.
Math.abs can return a negative value.
Using & with a bit mask is faster however you need a size which is a power of 2.
I was also wondering if built-in hashCode() method in String class is efficient and suitable for what I am trying to implement or is it better to create my own one.
The built in hashCode is cached, so it is fast. However it is not a great hashCode and has poor randomness for lower bit, and higher bit for short strings. You might want to implement your own hashing strategy, possibly a 64-bit one.
DEALING WITH MULTIPLE OCCURRENCES:
This is usually done with a counter for each key. This way you can have say 32767 duplicates (if you use short) or 2 billion (if you use int) duplicates of the same key/element.
I've found many resources online, discussing this and related topics, but I haven't found anything that really helps me know where to start with implementing this solution.
To clarify, starting from city 0, I need to visit every other city once, and return back to city 0.
I have an array such as this one:
0 1129 1417 1240 1951
1129 0 1100 800 2237
1417 1100 0 1890 3046
1240 800 1890 0 1558
1951 2237 3046 1558 0
Along with finding the optimal route, I need to also find the optimal partial routes along the way. For example, I'd start with routes of length 2, and end up printing out something like this:
S = {0,1}
C({0,1},1) = 1129
S = {0,2}
C({0,2},2) = 1417
S = {0,3}
C({0,3},3) = 1240
S = {0,4}
C({0,4},4) = 1951
Then I'd go to routes of length 3, and print something like this:
S = {0,1,2}
C({0,1,2},1) = 2517
C({0,1,2},2) = 2229
and so on...
To make this a dynamic programming solution, I assume I should be saving the shortest distance between any n nodes, and the best way I've thought to do that is with a Hashmap, where the key would be an integer value of every node included in that path, in ascending order (A path going from nodes 0>1>3>4 or 0>1>4>3 could be stored as '134'), and each key would hold a pair that could store the path order as a List, and the total distance as an integer.
At this point I would think I'd want to calculate all paths of distance 2, then all of distance 3, and then take the smallest few and use the hashmap to find the shortest path back for each, and compare.
Does this seem like it could work? Or am I completely on the wrong track?
You're sort of on track. Dynamic programming isn't the way to calculate a TSP. What you're sort of close to is calculating a minimum spanning tree. This is a tree that connects all nodes using the shortest possible sum of edges. There are two algorithms that are frequently used: Primm's, and Kruskal's. They produce something similar to your optimal partial routes list. I'd recommend you look at Primm's algorithm: https://en.wikipedia.org/wiki/Prim%27s_algorithm
The easiest way of solving TSP is by finding the minimum spanning tree, and then doing a pre-order tree walk over the tree. This gives you an approximate travelling salesman solution, and is known as the Triangle Inequality Approximation. It's guaranteed to be no more than twice as long as an optimal TSP, but it can be calculated much faster. This web page explains it fairly well http://www.personal.kent.edu/~rmuhamma/Algorithms/MyAlgorithms/AproxAlgor/TSP/tsp.htm
If you want a more optimal solution, you'll need to look at Christofide's method, which is more complicated.
You are on the right track.
I think you're getting at the fact that the DP recursion itself only tells you the optimal cost for each S and j, not the node that attained that cost. The classical way to handle this is through "backtracking": once you found out that, for example, C({0,1,2,3,4},-) = 10000, you figure out which node attained that cost; let's say it's 3. Then you figure out which node attained C({0,1,2,3,4},3); let's say it's 1. Then you figure out which node attained C({0,1,2,4},1); let's say it's 2, ... and so on. This is the classical way and it avoids having to store a lot of intermediate data. For DPs with a small state space, it's easier just to store all those optimizers along the way. In your case you have an exponentially large state space so it might be expensive to store all of those optimizers, but you already have to store an equally large data structure (C), so most likely you can store the optimizers as well. Your intermediate solution of storing a few of the best could work -- then if the backtracking routine turns out to need one you didn't store you can just calculate it the classical way -- seems reasonable, but I'm not sure it's worth the extra coding vs. a pure backtracking approach.
One clarification: You don't actually want to "calculate all paths of distance 2, then all of distance 3, ...", but rather you want to enumerate all sets of size 2, then all sets of size 3, etc. Still exponential, but not as bad as enumerating all paths.
I made a median filter algorithm and I want to optimize it. Currently it's taking around 1 second to filter 2MM lines(a file read into an ArrayList elements) and I am trying to reduce it to less(maybe half the time?) I'm using ArrayLists for my algorithm and minimized the use of nested loops as well to avoid an increase in time, however I still can't achieve lower than 0.98 seconds tops.
Here's a code snippet that does the median filter:
//Start Filter Algorithm 2
int index=0;
while(index<filterSize){
tempElements.add(this.elements.get(index+counter)); //Add element to a temporary arraylist
index+=1;
if(index==filterSize){
outputElements.add(tempElements.get((filterSize-1)/2)); //Add median Value to output ArrayList
tempElements.clear(); //Clear temporary ArrayList
index = 0; //Reset index
counter+=1; //Counter increments by 1 to move to start on next element in elements ArrayList
}
if(elementsSize-counter <filterSize){
break; //Break if there is not enough elements for the filtering to work
}
}
What's happening is that I'm looping through the elements arraylist for the filterSize I provided. Then I add the elements to a temporary(tempElements) arraylist, sort it using Collections.sort()(this is what I want to avoid), find the median value and add it to my final output arraylist. Then I clear the tempElements arraylist and keep going through my loop until I cannot filter anymore due to the lack of elements(less than filterSize).
I'm just looking for a way to optimize it and get it faster. I tried to use a TreeSet but I cannot get the value at an index from it.
Thanks
The Java Collections.sort() implementation is as fast as it gets when it comes to sorting (dual pivot quick sort).
The problem here is not in the nitty gritty details but the fact that you are sorting at all! You only need to find the median and there are linear algorithms for that (sorting is log-linear). See selection for some inspiration. You might need to code it yourself since I don't think the java library has any public implementation available.
The other thing I suggest is to use a fixed size array (created once) instead of an ArrayList. Since you know the size of the filter beforehand that will give you a small speed boost.
Also I don't see how avoiding for loops helps performance in any way. Unless you profiled it and proved that it's the right thing to do, I would just write the most readable code possible.
Finally, TreeSet or any other kind of sorted data structure won't help either because the time complexity is log-linear for n insertions.
As an alternative to Giovanni Botta's excellent answer:
Suppose that you have an array [7, 3, 8, 4, 6, 6, 2, 4, 6] and a filterSize of 4. Then our first temp array will be [7, 3, 8, 4] and we can sort it to get [3, 4, 7, 8]. When we compute our next temporary array, we can do it in linear (or better?) time as follows:
remove 7
insert 6 in sorted position
We can repeat this for all temp arrays after the initial sort. If you're spending a lot of time sorting subarrays, this might not be a bad way to go. The trick is that it increases required storage since you need to remember the order in which to remove the entries, but that shouldn't be a big deal (I wouldn't think).
I was looking at the source code of the sort() method of the java.util.ArrayList on grepcode. They seem to use insertion sort on small arrays (of size < 7) and merge sort on large arrays. I was just wondering if that makes a lot of difference given that they use insertion sort only for arrays of size < 7. The difference in running time will be hardly noticeable on modern machines.
I have read this in Cormen:
Although merge sort runs in O(n*logn) worst-case time and insertion sort runs in O(n*n) worst-case time, the constant factors in insertion sort can make it faster in practice for small problem sizes on many machines. Thus, it makes sense to coarsen the leaves of the recursion by using insertion sort within merge sort when subproblems become sufficiently small.
If I would have designed sorting algorithm for some component which I require, then I would consider using insertion-sort for greater sizes (maybe upto size < 100) before the difference in running time, as compared to merge sort, becomes evident.
My question is what is the analysis behind arriving at size < 7?
The difference in running time will be hardly noticeable on modern machines.
How long it takes to sort small arrays becomes very important when you realize that the overall sorting algorithm is recursive, and the small array sort is effectively the base case of that recursion.
I don't have any inside info on how the number seven got chosen. However, I'd be very surprised if that wasn't done as the result of benchmarking the competing algorithms on small arrays, and choosing the optimal algorithm and threshold based on that.
P.S. It is worth pointing out that Java7 uses Timsort by default.
I am posting this for people who visit this thread in future and documenting my own research. I stumbled across this excellent link in my quest to find the answer to the mystery of choosing 7:
Tim Peters’s description of the algorithm
You should read the section titled "Computing minrun".
To give a gist, minrun is the cutoff size of the array below which the algorithm should start using insertion sort. Hence, we will always have sorted arrays of size "minrun" on which we will need to run merge operation to sort the entire array.
In java.util.ArrayList.sort(), "minrun" is chosen to be 7, but as far as my understanding of the above document goes, it busts that myth and shows that it should be near powers of 2 and less than 256 and more than 8. Quoting from the document:
At 256 the data-movement cost in binary insertion sort clearly hurt, and at 8 the increase in the number of function calls clearly hurt. Picking some power of 2 is important here, so that the merges end up perfectly balanced (see next section).
The point which I am making is that "minrun" can be any power of 2 (or near power of 2) less than 64, without hindering the performance of TimSort.
http://en.wikipedia.org/wiki/Timsort
"Timsort is a hybrid sorting algorithm, derived from merge sort and insertion sort, designed to perform well on many kinds of real-world data... The algorithm finds subsets of the data that are already ordered, and uses the subsets to sort the data more efficiently. This is done by merging an identified subset, called a run, with existing runs until certain criteria are fulfilled."
About number 7:
"... Also, it is seen that galloping is beneficial only when the initial element is not one of the first seven elements of the other run. This also results in MIN_GALLOP being set to 7. To avoid the drawbacks of galloping mode, the merging functions adjust the value of min-gallop. If the element is from the array currently under consideration (that is, the array which has been returning the elements consecutively for a while), the value of min-gallop is reduced by one. Otherwise, the value is incremented by one, thus discouraging entry back to galloping mode. When this is done, in the case of random data, the value of min-gallop becomes so large, that the entry back to galloping mode never takes place.
In the case where merge-hi is used (that is, merging is done right-to-left), galloping needs to start from the right end of the data, that is the last element. Galloping from the beginning also gives the required results, but makes more comparisons than required. Thus, the algorithm for galloping includes the use of a variable which gives the index at which galloping should begin. Thus the algorithm can enter galloping mode at any index and continue thereon as mentioned above, as in, it will check at the next index which is offset by 1, 3, 7,...., (2k - 1).. and so on from the current index. In the case of merge-hi, the offsets to the index will be -1, -3, -7,...."
I got requirements-
1. Have random values in a List/Array and I need to find 3 max values .
2. I have a pool of values and each time this pool is getting updated may be in every 5 seconds, Now every time after the update , I need to find the 3 max Values from the list pool.
I thought of using Math.max thrice on the list but I dont think it as
a very optimized approach.
> Won't any sorting mechanism be costly as I am bothered about only top
3 Max Values , why to sort all these
Please suggest the best way to do it in JAVA
Sort the list, get the 3 max values. If you don't want the expense of the sort, iterate and maintain the n largest values.
Maintain the pool is a sorted collection.
Update: FYI Guava has an Ordering class with a greatestOf method to get the n max elements in a collection. You might want to check out the implementation.
Ordering.greatestOf
Traverse the list once, keeping an ordered array of three largest elements seen so far. This is trivial to update whenever you see a new element, and instantly gives you the answer you're looking for.
A priority queue should be the data structure you need in this case.
First, it would be wise to never say again, "I dont think it as a very optimized approach." You will not know which part of your code is slowing you down until you put a profiler on it.
Second, the easiest way to do what you're trying to do -- and what will be most clear to someone later if they are trying to see what your code does -- is to use Collections.sort() and pick off the last three elements. Then anyone who sees the code will know, "oh, this code takes the three largest elements." There is so much value in clear code that it will likely outweigh any optimization that you might have done. It will also keep you from writing bugs, like giving a natural meaning to what happens when someone puts the same number into the list twice, or giving a useful error message when there are only two elements in the list.
Third, if you really get data which is so large that O(n log n) operations is too slow, you should rewrite the data structure which holds the data in the first place -- java.util.NavigableSet for example offers a .descendingIterator() method which you can probe for its first three elements, those would be the three maximum numbers. If you really want, a Heap data structure can be used, and you can pull off the top 3 elements with something like one comparison, at the cost of making adding an O(log n) procedure.