Optimizing heap structure for heapsort

Optimizing heap structure for heapsort - java

I'm implementing heapsort using a heap. To do this each value to be sorted is inserted. The insertion method calls heapifyUp() (aka siftUp) so this means each time another value is inserted heapifyUp is called. Is this the most efficient way?
Another idea would be to insert all elements, and then call heapifyUp. I guess heapifyUp would have to be called on each one? Is doing it like this better?

Inserting each element will build the heap in O(n log n) time. Same thing if you add all the elements to an array and then repeatedly call heapifyUp().
Floyd's Algorithm builds the heap bottom-up in O(n) time. The idea is that you take an array that's in any order and, starting in the middle, sift each item down to its proper place. The algorithm is:
for i = array.length/2 downto 0
{
siftDown(i)
}
You start in the middle because the last length/2 items in the array are leaves. They can't be sifted down. By working your way from the middle, up, you reduce the number of items that have to be moved.
Example of the difference
The example below, turning an array of 7 items into a heap, shows the difference in the amount of work done.
The heapifyUp() method
[7,5,6,1,2,3,4] (starting state)
Start at the end and bubble items up.
Move 4 to the proper place
[7,5,4,1,2,3,6]
[4,5,7,1,2,3,6]
Move 3 to its place
[4,5,3,1,2,7,6]
[3,5,4,1,2,7,6]
Move 2 to its place
[3,2,4,1,5,7,6]
[2,3,4,1,5,7,6]
Move 1 to its place
[2,1,4,3,5,7,6]
[1,2,4,3,5,7,6]
The heap is now in order. It took 8 swaps, and you still have to check 4, 2, and 1.
Floyd's algorithm
[7,5,6,1,2,3,4] (starting state)
Start at the halfway point and sift down. In a 0-based array of 7 items, the halfway point is 3.
Move 1 to its place
[7,5,6,1,2,3,4] (no change. Remember, we're sifting down)
Move 6 to its place
[7,5,3,1,2,6,4]
Move 5 to its place
[7,1,3,5,2,6,4]
[7,1,3,4,2,6,5]
Move 7 to its place
[1,7,3,5,2,6,4]
[1,2,3,5,7,6,4]
And we're done. It took 5 swaps and there's nothing else to check.

Related

How do I find the optimal path through a grid?

Overview of the problem: You are a truffle collector, and are given a grid of numbers representing plots of land with truffles on them. Each plot has a certain number of truffles on it. You must find the optimal path from the top of the grid to the bottom (the one that collects the most truffles). Importantly, you can start from any cell in the top row. When you are at a cell, you can move diagonally down to the left, directly down, or diagonally down to the right. A truffle field might look like this:
The truffle fields also do not have to be square. They can have any dimensions.
So, I have created an iterative algorithm for this problem. Essentially, what I have done is iterate through each cell in the top row, finding the greedy path emanating from each and choosing the one with the largest truffle yield. To elaborate, the greedy path is one in which at every step, the largest value that can be reached in the next row from the current cell is chosen.
This algorithm yields the correct result for some truffle fields, like the one above, but it fails on fields like this:
This is because when the algorithm hits the 100 in the third column, it will go directly down to the 3 because it is the largest immediate value it can move to, but it does not consider that moving to the 2 to the left of it will enable it to reach another 100. The optimal path through this field obviously involves both cells with a value of 100, but the greedy algorithm I have now will never yield this path.
So, I have a hunch that the correct algorithm for this problem involves recursion, likely recursive backtracking in particular, but I am not sure how to approach creating a recursive algorithm to solve it. I have always struggled with recursion and find it difficult to come up with algorithms using it. I would really appreciate any ideas you all could provide.
Here is the code. My algorithm is being executed in the findPath method: https://github.com/jhould007/Programming-Assignment-3/blob/master/Truffle.java.

You could use recursion, but there's a simple iterative fix to your approach as well.
Instead of the top row, start with the bottom one. Create a 1D array values and initialise it with the values of the bottom row.
Start iterating curr_row from row max_row-1 to 0. For each iteration, create a temporary array temp and initialise it with 0's.
For a row curr_row in the iteration, value[i] represents the max value that you can get if you start from row curr_row+1 (basically the next row) and column i.
To update temp in each iteration, we just need to pick the best path from the next row, which can be fetched from values array.
for column in range [0, max_column]:
temp[column] = truffle_value[column] + max(value[column], value[column+1], value[column-1])
// since temp holds the values for the next iteration in our loop
value = temp
In the end, the answer will simply be max(values).

All combinations of 2d array chess game

I am making a program that is calculating the number of possible solutions of a chess game with only bishops and queens. The user can put in the number of queens and bishops, as well as the size of the chess board (rows & columns).
I will call any set of positions for the bishops and the queens on the board a combination. A combination counts as a solution if all squares are attacked (Chess domination problem).
So for example, if the user gives 1 Queen and 3 Bishops, on a 5x5 chess board, a possible solution can be:
- - B - -
- - - - -
- B Q B -
- - - - -
- - - - -
Now i have trouble making a program that finds all the possible positions of a given set of pieces, without duplicates. Duplicates can occur because the user can give a multiple number of bishops for example. The solution needs to be recursive.

You don't show your current solution, but I /assume/ you pick each square for the first piece, then pick each square for the second piece , and continue if the square is still unoccupied. Then repeat for the third, etc.
If the first and second piece are the same type, then this will cause a duplication. First piece in first position, second in second vs first in second position, second in first.
If you have two pieces of the same type, you can just impose an ordering on your positioning the second piece: Do not place the second identical piece at a lower index than the first. This avoids the duplication, but still visits every permutation of positions. You can impose this ordering on more of the same types of piece.
When you have a different type of piece, the two orderings become distinct. If the first and second are different piece types, then First piece in first position, second in second vs first in second position, second in first are distinct cases. However, when you put down the second instance of the new type, you can apply the ordering rule against the first one.
[Alternatively, you can insist that the second piece is placed before the first piece - the outcome is the same]
As a second optimisation, you can observe that if you have 3 bishops and the third must be placed after the other 2, then the first cannot be placed in the last or penultimate squares, so you can optimise your placement of the first one very slightly.
This becomes more complex when this is the second type of piece, and it is possibly not worth doing.
A third optimisation is to keep a list of the available squares. Once a piece is put down, its square is removed from the list, so the list is shorter for placing the next piece, and you don't have to "fail" when you try and put the queen on top of a bishop, as you won't try. You can use the length of this list to simplify the second optimisation.
You can do some clever tricks with std::list::splice to mean that you don't reallocate or duplicate this list as you recurse through the pieces and positions.

Better algorithmic approach to showing trends of data per week

Suppose I have a list of projects with start date and end date. I also have a range of weeks, which varies (could be over months, years, etc)
I would like to display a graph showing 4 values per week:
projects started
projects closed
total projects started
total projects closed
I could loop over the range of weekly values, and for each week iterate through my list of projects and calculate values for each of these 4 trends per week. This would have algorithmic complexity O(nm), n is the length of list of weeks, and m is the length of projects list. That's not so great.
Is there a more efficient approach, and if so, what would it be?
If it's pertinent, I'm coding in Java

While it is true what user yurib has said there is a more efficient solution. Keep two arrays in memory projects_started and projects_ended, both with size 52. Loop through your list of projects and for each project increment corresponding value in both lists. Something like:
projects_started[projects[i].start_week]++;
projects_ended[projects[i].end_week]++;
After the loop you have all the data you need to make a graph. Complexity is O(m).
EDIT: okay, so maximum number of weeks can vary apparently, but if it's smaller than some ludicrous number (more than say a million) then this algorithm still works. Just replace 52 with n. Time complexity is O(m), space complexity is O(n).
EDIT: in order to determine the value of total projects started and ended you have to iterate through the two arrays that you now have and just add up the values. You could do this while populating the graph:
for (int i = 0; i < n)
{
total_started_in_this_week += projects_started[i];
total_ended_in_this_week += projects_ended[i];
// add new item to the graph
}

I'm not sure what the difference between "project" and "total" is, but here's a simple O(n log n) way to calculate the number of projects started and closed in each week:
For each project, add its start and end points to a list.
Sort the list in increasing order.
Walk through the list, pulling out time points until you hit a time point that occurs in a later week. At this point, "projects started" is the total number of start points you have hit, and "projects ended" is the total number of end points you have hit: report these counters, and reset them both to zero. Then continue on to process the next week.
Incidentally, if there are some weeks without any projects that start or end, this procedure will skip them out. If you want to report these weeks as "0, 0" totals, then whenever you output a week that has some nonzero total, make sure you first output as many "0, 0" weeks as it takes to fill in the gap since the last nonzero-total week. (This is easy to do just by setting a lastNonzeroWeek variable each time you output a nonzero-total week.)

First of all, I guess that actually performance won't be an issue; this looks like a case of "premature optimization". You should first do it, then do it right, then do it fast.
I suggest you use maps, which will make your code more readable and outsources implementation details (like performance).
Create a HashMap from int (representing the week number) to Set<Project>, then iterate over your projects and for each one, put it into the map at the right place. After that, iterate over the map's key set (= all non-empty weeks) and do your processing for each one.

Stream of numbers and best space complexity to find n/2th element

I was trying to solve this problem where a stream of numbers of length not more than M will be given. You don't know the exact length of the stream but are sure that it wont exceed M. At the end of the stream, you have to tell the N/2th element of the stream, considering that N elements came in the stream. what would be best space complexity with which you can solve this problem
my solution:
i think we can take a queue of size m/2 , and push two element , then pop 1 element and keep on till
stream is over . The n/2th will be at head of queue then. Time complexity will be min O(n) for any way , but for this approach,space complexity is m/2 .. is there any better solution?

I hope it is obvious that you will need at least N/2 memory allocation (Unless you can re-iterate through your steam, reading the same data again) . Your algorithm uses M/2, given the fact that N is upper bounded by M would make it look like it doesn't matter which you will choose, since N can go up to M.
But it doesn't have to. If you consider N being way smaller than M (for example N=5 and M=1 000 000) then you would waste a lot of resources.
I would recommend some dynamic growth array structure, something like ArrayList, but that is not good for removing first element.
Conclusion: You can have O(N) both time and memory complexity, and you can't get any better.
Friendly edit regarding ArrayList: adding an element to an ArrayList is in "amortized constant time", so adding N items is O(N) in time. Removing them, however, is linear (per JavaDoc) so you can definitely get O(N) in time and space but ONLY IF you don't remove anything. If you do remove, you get O(N) in space (O(N/2) = O(N)), but your time complexity goes up.

Do you know the "tortoise and hare" algorithm? Start with two pointers to the beginning of the input. Then at each step advance the hare two elements and the tortoise one element. When the hare reaches the end of the input the tortoise is at the midpoint. This is O(n) time, since it visits each element of the input once, and O(1) space, since it keeps exactly two pointers regardless of the size of the input.

Java.util.ArrayList.sort() sorting algorithm

I was looking at the source code of the sort() method of the java.util.ArrayList on grepcode. They seem to use insertion sort on small arrays (of size < 7) and merge sort on large arrays. I was just wondering if that makes a lot of difference given that they use insertion sort only for arrays of size < 7. The difference in running time will be hardly noticeable on modern machines.
I have read this in Cormen:
Although merge sort runs in O(n*logn) worst-case time and insertion sort runs in O(n*n) worst-case time, the constant factors in insertion sort can make it faster in practice for small problem sizes on many machines. Thus, it makes sense to coarsen the leaves of the recursion by using insertion sort within merge sort when subproblems become sufficiently small.
If I would have designed sorting algorithm for some component which I require, then I would consider using insertion-sort for greater sizes (maybe upto size < 100) before the difference in running time, as compared to merge sort, becomes evident.
My question is what is the analysis behind arriving at size < 7?

The difference in running time will be hardly noticeable on modern machines.
How long it takes to sort small arrays becomes very important when you realize that the overall sorting algorithm is recursive, and the small array sort is effectively the base case of that recursion.
I don't have any inside info on how the number seven got chosen. However, I'd be very surprised if that wasn't done as the result of benchmarking the competing algorithms on small arrays, and choosing the optimal algorithm and threshold based on that.
P.S. It is worth pointing out that Java7 uses Timsort by default.

I am posting this for people who visit this thread in future and documenting my own research. I stumbled across this excellent link in my quest to find the answer to the mystery of choosing 7:
Tim Peters’s description of the algorithm
You should read the section titled "Computing minrun".
To give a gist, minrun is the cutoff size of the array below which the algorithm should start using insertion sort. Hence, we will always have sorted arrays of size "minrun" on which we will need to run merge operation to sort the entire array.
In java.util.ArrayList.sort(), "minrun" is chosen to be 7, but as far as my understanding of the above document goes, it busts that myth and shows that it should be near powers of 2 and less than 256 and more than 8. Quoting from the document:
At 256 the data-movement cost in binary insertion sort clearly hurt, and at 8 the increase in the number of function calls clearly hurt. Picking some power of 2 is important here, so that the merges end up perfectly balanced (see next section).
The point which I am making is that "minrun" can be any power of 2 (or near power of 2) less than 64, without hindering the performance of TimSort.

http://en.wikipedia.org/wiki/Timsort
"Timsort is a hybrid sorting algorithm, derived from merge sort and insertion sort, designed to perform well on many kinds of real-world data... The algorithm finds subsets of the data that are already ordered, and uses the subsets to sort the data more efficiently. This is done by merging an identified subset, called a run, with existing runs until certain criteria are fulfilled."
About number 7:
"... Also, it is seen that galloping is beneficial only when the initial element is not one of the first seven elements of the other run. This also results in MIN_GALLOP being set to 7. To avoid the drawbacks of galloping mode, the merging functions adjust the value of min-gallop. If the element is from the array currently under consideration (that is, the array which has been returning the elements consecutively for a while), the value of min-gallop is reduced by one. Otherwise, the value is incremented by one, thus discouraging entry back to galloping mode. When this is done, in the case of random data, the value of min-gallop becomes so large, that the entry back to galloping mode never takes place.
In the case where merge-hi is used (that is, merging is done right-to-left), galloping needs to start from the right end of the data, that is the last element. Galloping from the beginning also gives the required results, but makes more comparisons than required. Thus, the algorithm for galloping includes the use of a variable which gives the index at which galloping should begin. Thus the algorithm can enter galloping mode at any index and continue thereon as mentioned above, as in, it will check at the next index which is offset by 1, 3, 7,...., (2k - 1).. and so on from the current index. In the case of merge-hi, the offsets to the index will be -1, -3, -7,...."

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.