Better algorithmic approach to showing trends of data per week

Better algorithmic approach to showing trends of data per week - java

Suppose I have a list of projects with start date and end date. I also have a range of weeks, which varies (could be over months, years, etc)
I would like to display a graph showing 4 values per week:
projects started
projects closed
total projects started
total projects closed
I could loop over the range of weekly values, and for each week iterate through my list of projects and calculate values for each of these 4 trends per week. This would have algorithmic complexity O(nm), n is the length of list of weeks, and m is the length of projects list. That's not so great.
Is there a more efficient approach, and if so, what would it be?
If it's pertinent, I'm coding in Java

While it is true what user yurib has said there is a more efficient solution. Keep two arrays in memory projects_started and projects_ended, both with size 52. Loop through your list of projects and for each project increment corresponding value in both lists. Something like:
projects_started[projects[i].start_week]++;
projects_ended[projects[i].end_week]++;
After the loop you have all the data you need to make a graph. Complexity is O(m).
EDIT: okay, so maximum number of weeks can vary apparently, but if it's smaller than some ludicrous number (more than say a million) then this algorithm still works. Just replace 52 with n. Time complexity is O(m), space complexity is O(n).
EDIT: in order to determine the value of total projects started and ended you have to iterate through the two arrays that you now have and just add up the values. You could do this while populating the graph:
for (int i = 0; i < n)
{
total_started_in_this_week += projects_started[i];
total_ended_in_this_week += projects_ended[i];
// add new item to the graph
}

I'm not sure what the difference between "project" and "total" is, but here's a simple O(n log n) way to calculate the number of projects started and closed in each week:
For each project, add its start and end points to a list.
Sort the list in increasing order.
Walk through the list, pulling out time points until you hit a time point that occurs in a later week. At this point, "projects started" is the total number of start points you have hit, and "projects ended" is the total number of end points you have hit: report these counters, and reset them both to zero. Then continue on to process the next week.
Incidentally, if there are some weeks without any projects that start or end, this procedure will skip them out. If you want to report these weeks as "0, 0" totals, then whenever you output a week that has some nonzero total, make sure you first output as many "0, 0" weeks as it takes to fill in the gap since the last nonzero-total week. (This is easy to do just by setting a lastNonzeroWeek variable each time you output a nonzero-total week.)

First of all, I guess that actually performance won't be an issue; this looks like a case of "premature optimization". You should first do it, then do it right, then do it fast.
I suggest you use maps, which will make your code more readable and outsources implementation details (like performance).
Create a HashMap from int (representing the week number) to Set<Project>, then iterate over your projects and for each one, put it into the map at the right place. After that, iterate over the map's key set (= all non-empty weeks) and do your processing for each one.

Related

Optimizing heap structure for heapsort

I'm implementing heapsort using a heap. To do this each value to be sorted is inserted. The insertion method calls heapifyUp() (aka siftUp) so this means each time another value is inserted heapifyUp is called. Is this the most efficient way?
Another idea would be to insert all elements, and then call heapifyUp. I guess heapifyUp would have to be called on each one? Is doing it like this better?

Inserting each element will build the heap in O(n log n) time. Same thing if you add all the elements to an array and then repeatedly call heapifyUp().
Floyd's Algorithm builds the heap bottom-up in O(n) time. The idea is that you take an array that's in any order and, starting in the middle, sift each item down to its proper place. The algorithm is:
for i = array.length/2 downto 0
{
siftDown(i)
}
You start in the middle because the last length/2 items in the array are leaves. They can't be sifted down. By working your way from the middle, up, you reduce the number of items that have to be moved.
Example of the difference
The example below, turning an array of 7 items into a heap, shows the difference in the amount of work done.
The heapifyUp() method
[7,5,6,1,2,3,4] (starting state)
Start at the end and bubble items up.
Move 4 to the proper place
[7,5,4,1,2,3,6]
[4,5,7,1,2,3,6]
Move 3 to its place
[4,5,3,1,2,7,6]
[3,5,4,1,2,7,6]
Move 2 to its place
[3,2,4,1,5,7,6]
[2,3,4,1,5,7,6]
Move 1 to its place
[2,1,4,3,5,7,6]
[1,2,4,3,5,7,6]
The heap is now in order. It took 8 swaps, and you still have to check 4, 2, and 1.
Floyd's algorithm
[7,5,6,1,2,3,4] (starting state)
Start at the halfway point and sift down. In a 0-based array of 7 items, the halfway point is 3.
Move 1 to its place
[7,5,6,1,2,3,4] (no change. Remember, we're sifting down)
Move 6 to its place
[7,5,3,1,2,6,4]
Move 5 to its place
[7,1,3,5,2,6,4]
[7,1,3,4,2,6,5]
Move 7 to its place
[1,7,3,5,2,6,4]
[1,2,3,5,7,6,4]
And we're done. It took 5 swaps and there's nothing else to check.

Interview: Top N integers in a List over a period of time

I was recently asked a interview question to find Top N(10,20) integers in a List over a period of time. The List is dynamically added elements over a period of regular interval like 5 seconds. Could you please tell how to use the correct data structure and algorithm for this problem..

Such questions normally are not very sophisticated.
10 highest of last 20 entries: An ArrayList of at most 20 elements, adding at the end may remove one at the beginning. Then add them to a new SortedSet (like TreeSet), and take the first 10 on a reversed order. See #iced
If a Queue would fit, nice. (It does not entirely.) But the most important point is correctness. Seeing you cannot sort that ArrayList. That less than 10 top numbers may appear when many duplicates. Points for adding concurrency guards and such.

You can use Java Priority Queue/ Max heap.

Job Scheduling Algorithm in Java

I need to design an efficient algorithm for a scheduling problem and I really don't have a clue.
There's a machine that produces pills in a certain rate. For example, the machine might be capable to produce 1 pill if it is allowed to continuously work for one day, 4 pills, if it is allowed to work for 3 days, 16 pills if it works for 5 days, and so on. If I stop the machine and take out all the pills then the machine will start from day 1 again. All pills that I took out from the machine must be used on the same day.
There is a certain amount of patients come and take pills everyday. The patients must be treated on the same day, untreated patients are ignored. The goal is to decide which days to stop the machine and treat as many patients as possible in n days.
Suppose the number of days n = 5, given example input
int[] machineRate = {1,2,4,8,16};
int[] patients = {1,2,3,0,2};
In this case, if I stop the machine on day 3, I will have 4 pills. I can treat 3 patients and throw away 1 pill. Then I stop the machine on day 5 again, since it was stopped on day 3, it has been working for 2 days, therefore I have 2 pills to treat 2 patients. In the end 3+2=5, the output = 5 patients.
If I stop the machine on day 2, day 3, day 5. Then the output will be (2 pills for 2 patients on day 2) +(1 pill for 3 patients on day 3) +（2 pills for 2 patients on day 5）. That equals to 5 patients as well.
The machineRate[] and patients[] vary according to input.
What's the algorithm that finds the maximum number of treated patients?

This is a nice dynamic programming problem.
The way to think of dynamic programming is to ask yourself two questions:
Is there a trivial version of this problem if I reduce one (or more) of the variables to zero (or similar)?
Is there a simple way of calculating the answer to a problem of size n+1 if I know answers to all problems of size n? (Here, "size" is problem-specific, and you need to find the right notion of size that helps with the problem in hand.)
For this problem, what would be a trivial version? Well, suppose the number of days was 1. Then it would be easy: I stop the machine, and treat as many patients as I can. There's no point doing anything else.
Now, if we consider the number of days left as our notion of size, we get an answer to the second question as well. Suppose we know all answers to all problems where there are n days left. Let's write maxTreat(days, running) for the maximum number we could treat if there were days days left, and if the machine had initially been running for running days.
Now there are n+1 days left; and the machine has been running so far for k days. We've got two options: (1) stop the machine; (2) don't stop it. If we stop the machine, we can treat some patients today (we can work out the number based on k), and thereafter we can treat maxTreat(n, 1) patients, because there are then n days left, and by tomorrow the machine will have been running again for just one day. If we don't stop the machine, we can't treat anyone today, but thereafter we'll be able to treat maxTreat(n,k+1) patients, because by tomorrow the machine will have been running for k+1 days.
I will leave you to work out the precise details, but to solve it efficiently, we create a multidimensional array, based on number of days left, and number of days for which the machine has been running so far. We then iterate through the array, solving all the possible problems, starting from the trivial (one day left) and working backwards (two days, then three days, and so on). At each stage, the problem we're solving is either trivial (so we can just write the answer in), or something we can calculate from entries we wrote into the array at the previous step.
The really cool thing about dynamic programming is that we're creating a cache of all the results as we go. So for problems where a recursive approach would otherwise end up needing to calculate the answer to a sub-problem several times, with dynamic programming we never end up solving a sub-problem more than once.
Additional comments now that I've seen your implementation:
For one, I'm not too surprised that it starts to slow down when you hit 10,000 or so. The algorithm is O(n^2), because at each iteration you have to fill the array with up to n entries before you can move to the next level. I'm quite certain that O(n^2) is the best asymptotic complexity you're going to get for this puzzle, though.
If you want to speed it up further, you could look at a top-down approach. Currently you're doing bottom-up dynamic programming: solving the cases of size 0, then of size 1, and so on. But you can also do it the other way round. Essentially this is the same algorithm as if you were writing a horribly inefficient recursive solution that calculates solutions to sub-problems on the fly, except that you cache a result every time you calculate it. So it looks something like this:
Set up your two-dimensional array to hold solutions to sub-problems. Pre-fill it with -1 for each case. A value of -1 will indicate that you haven't solved that sub-problem yet.
Write a routine that solves maxTreat(days, running) in terms of answers to sub-problems at the next level down. When you want the answers to the sub-problems, look in the array. If there's a -1 in there, you haven't solved that one yet, so you recursively solve it, and then put the answer into the array. If there's anything other than -1, you can just use the value you find there, because you've already calculated it. (You can also use a HashMap instead of the multidimensional array.)
This is better in one way and worse in another. It's worse because you have overheads associated with the recursion, and because you'll eventually run out of stack with the recursive calls. You might need to bump up the stack size with a command-line parameter to the JVM.
But it's better in one key respect: you don't calculate answers to all sub-problems, but only the ones you need to know the answers to. For some problems, that's a massive difference, and for some, it's not. It's hard to get the right intuition, but I think it might make a big difference here. After all, each answer depends on only two sub-problems from the previous row.
The ultimate solution (don't try this until you get the top-down recursive one going first!) is to do it top-down but without recursion. This will avoid the stack space issue. To do this, you create a stack of sub-problems (use an ArrayDeque) that need solving, and you keep taking them off the front of the queue until there are none left. The first thing is to push onto the stack the large problem for which you need a solution. Now, you iteratively pop problems off the stack until it's empty. Pop one off, and call it P. Then:
Look in your array or HashMap to see if P has been solved. If so, return the answer.
If not, look to see if the sub-problems for P have already been solved. If they have, then you can solve P, and you cache the answer. If the stack's now empty, then you've solved your final problem, and you output the answer for P.
If the sub-problems haven't all been solved, then push P back onto the stack. Then push any of P's sub-problems that haven't yet been solved onto the stack as well.
What will happen as you go is that your stack will grow initially as you push the main problem, and its sub-problems, and then its sub-problems, onto the stack. Then you'll start solving the smaller instances and putting the results into the cache, until eventually you have everything you need to solve the main problem.
It doesn't use significantly less memory than the recursive top-down approach, but it does use heap space rather than JVM stack space, and that means it scales up better because the JVM stack is much smaller than the heap.
This is quite difficult, though. At the very least, keep your working solution before you start coding up the more difficult version!

Another approach would be to predict the next day or days. Say we have seen 1,2.patients in the last days, we could either take the two pills today and cure two patients or predict three or more for the next day and let machine run. If we have no raise like 1,1, we would predict one patient for tomorrow and take the one pill today. If the next day turns out diffrent like 1, 4, 0, we just adjust the prediction for the next day to be 1/2, i.e 2.
Upside of this solution is that you can work with uncertainty, i.e. you do not know what tomorrow brings. this allows us to stream the data.
Down side is that the first patient will always die.

I've implemented chiastic-security's design, but the performance isn't great when n gets larger than 10000 or so. If anyone has any other ideas please let me know because I thought this was a pretty interesting problem. I tried it with recursion at first but kept running out of memory, so I had to do it in a loop instead. I was storing a big 2d array with all the results so far but then I realised that I only ever need to access the previous "row" of results so I'm only using 2 arrays: "current" and "previous":
static int calculateMax() {
int[] previous = new int[n];
for (int daysMachineRunning=0; daysMachineRunning<n; daysMachineRunning++) {
previous[daysMachineRunning] = treatPatients(0, daysMachineRunning);
}
int[] current = null;
for (int daysRemaining=1; daysRemaining<n; daysRemaining++) {
current = new int[n-daysRemaining];
for (int daysMachineRunning=0; daysMachineRunning<n-daysRemaining; daysMachineRunning++) {
current[daysMachineRunning] = Math.max(
treatPatients(daysRemaining, daysMachineRunning) + previous[0],
previous[daysMachineRunning+1]
);
}
previous = current;
}
return current[0];
}
static int treatPatients(int daysRemaining, int daysMachineRunning) {
return Math.min(patients[n-1-daysRemaining], machineRate[daysMachineRunning]);
}
EDIT: I've now implemented a 2nd approach, but still getting issues where n>=10000 or so: Exception in thread "main" java.lang.OutOfMemoryError: Java heap space. Here's my code if anyone is interested in pursuing further:
static final int[][] results = new int[n][n];
static final SortedSet<Target> queue = new TreeSet<>(new Comparator<Target>() {
#Override
public int compare(Target o1, Target o2) {
if (o1.daysRemaining < o2.daysRemaining)
return -1;
else if (o1.daysRemaining > o2.daysRemaining)
return 1;
else if (o1.daysMachineRunning < o2.daysMachineRunning)
return 1;
else if (o1.daysMachineRunning > o2.daysMachineRunning)
return -1;
else return 0;
}
});
public static void main(String[] args) {
for (int i=0; i<n; i++) {
Arrays.fill(results[i], -1);
}
if (n <= 10) {
System.out.println(Arrays.toString(machineRate));
System.out.println(Arrays.toString(patients));
} else
System.out.println(n);
System.out.println(calculateMax());
}
static class Target {
int daysRemaining, daysMachineRunning;
Target(int daysRemaining, int daysMachineRunning) {
this.daysRemaining = daysRemaining;
this.daysMachineRunning = daysMachineRunning;
}
}
static int calculateMax() {
addTarget(n-1, 0);
while (results[n-1][0]==-1) {
Target t = queue.first();
queue.remove(t);
calculateMax(t);
}
return results[n-1][0];
}
static void calculateMax(Target t) {
int daysRemaining = t.daysRemaining;
int daysMachineRunning = t.daysMachineRunning;
int treatedPatients = Math.min(patients[n-1-daysRemaining], machineRate[daysMachineRunning]);
if (daysRemaining==0)
results[0][daysMachineRunning] = treatedPatients;
else {
int resultA = results[daysRemaining-1][0];
int resultB = results[daysRemaining-1][daysMachineRunning+1];
if (resultA>=0 && resultB>=0) {
results[daysRemaining][daysMachineRunning] = Math.max(treatedPatients + resultA, resultB);
}
else {
if (resultA==-1)
addTarget(daysRemaining-1, 0);
if (resultB==-1)
addTarget(daysRemaining-1, daysMachineRunning+1);
addTarget(daysRemaining, daysMachineRunning);
}
}
}
static void addTarget(int a, int b) {
queue.add(new Target(a,b));
}

Stream of numbers and best space complexity to find n/2th element

I was trying to solve this problem where a stream of numbers of length not more than M will be given. You don't know the exact length of the stream but are sure that it wont exceed M. At the end of the stream, you have to tell the N/2th element of the stream, considering that N elements came in the stream. what would be best space complexity with which you can solve this problem
my solution:
i think we can take a queue of size m/2 , and push two element , then pop 1 element and keep on till
stream is over . The n/2th will be at head of queue then. Time complexity will be min O(n) for any way , but for this approach,space complexity is m/2 .. is there any better solution?

I hope it is obvious that you will need at least N/2 memory allocation (Unless you can re-iterate through your steam, reading the same data again) . Your algorithm uses M/2, given the fact that N is upper bounded by M would make it look like it doesn't matter which you will choose, since N can go up to M.
But it doesn't have to. If you consider N being way smaller than M (for example N=5 and M=1 000 000) then you would waste a lot of resources.
I would recommend some dynamic growth array structure, something like ArrayList, but that is not good for removing first element.
Conclusion: You can have O(N) both time and memory complexity, and you can't get any better.
Friendly edit regarding ArrayList: adding an element to an ArrayList is in "amortized constant time", so adding N items is O(N) in time. Removing them, however, is linear (per JavaDoc) so you can definitely get O(N) in time and space but ONLY IF you don't remove anything. If you do remove, you get O(N) in space (O(N/2) = O(N)), but your time complexity goes up.

Do you know the "tortoise and hare" algorithm? Start with two pointers to the beginning of the input. Then at each step advance the hare two elements and the tortoise one element. When the hare reaches the end of the input the tortoise is at the midpoint. This is O(n) time, since it visits each element of the input once, and O(1) space, since it keeps exactly two pointers regardless of the size of the input.

time range check with bit array and bit mask

I have a Java collection of <String username, ArrayList loginTimes>. For example, one entry might look like ["smith", [2012-10-2 08:04:23, 2012-10-4 06:34:21]]. The times have one second resolution. I am looking to output a list of usernames for all users that have logged in at least twice in a period that is more than 24 hrs apart but less than 7 days apart.
There is a simple O(n^2) way to do this where for a given user you compare each login time to every other login time and check to see if they match the required conditions. There are also a few O(nlogn) methods, such as storing the loginTimes as a binary search tree, and for each login time (N of them), look through the tree (log N) to see if there is another login time to match the requirements.
My understanding is that there is also a solution (O(n) or better?) where you create a bit array (BitSet) from the login times, and use some sort of a mask to check for the required conditions (at least two login times 24 hrs apart but less than 7 days apart). Anybody know how this could be achieved? Or other possible efficient (O(n) or better) solutions?

You can do it in O(M * NlogN) where M is the no. of users (size of the collection) and N the average length of loginTimes (it's an array):
For every user in the collection do:
1- sort the list loginTimes. This is a O(NlogN) task
2- scan the list and search if your constraints apply. This can be done in O(N) time.
So, for every user the total cost is O(N) + O(NlogN) => O(2N*logN) => O(NlogN)

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.