Algorithm to find peaks in 2D array - java

Let's say I have a 2D accumulator array in java int[][] array. The array could look like this:
(x and z axes represent indexes in the array, y axis represents values - these are images of an int[56][56] with values from 0 ~ 4500)
or
What I need to do is find peaks in the array - there are 2 peaks in the first one and 8 peaks in the second array. These peaks are always 'obvious' (there's always a gap between peaks), but they don't have to be similar like on these images, they can be more or less random - these images are not based on the real data, just samples. The real array can have size like 5000x5000 with peaks from thousands to several hundred thousands... The algorithm has to be universal, I don't know how big the array or peaks can be, I also don't know how many peaks there are. But I do know some sort of threshold - that the peaks can't be smaller than a given value.
The problem is, that one peak can consist of several smaller peaks nearby (first image), the height can be quite random and also the size can be significantly different within one array (size - I mean the number of units it takes in the array - one peak can consist from 6 units and other from 90). It also has to be fast (all done in 1 iteration), the array can be really big.
Any help is appreciated - I don't expect code from you, just the right idea :) Thanks!
edit: You asked about the domain - but it's quite complicated and imho it can't help with the problem. It's actually an array of ArrayLists with 3D points, like ArrayList< Point3D >[][] and the value in question is the size of the ArrayList. Each peak contains points that belong to one cluster (plane, in this case) - this array is a result of an algorithm, that segments a pointcloud . I need to find the highest value in the peak so I can fit the points from the 'biggest' arraylist to a plane, compute some parameters from it and than properly cluster most of the points from the peak.

He's not interested in estimating the global maximum using some sort of optimization heuristic - he just wants to find the maximum values within each of a number of separate clusters.
These peaks are always 'obvious' (there's always a gap between peaks)
Based on your images, I assume you mean there's always some 0-values separating clusters? If that's the case, you can use a simple flood-fill to identify the clusters. You can also keep track of each cluster's maximum while doing the flood-fill, so you both identify the clusters and find their maximum simultaneously.
This is also as fast as you can get, without relying on heuristics (which could return the wrong answer), since the maximum of each cluster could potentially be any value in the cluster, so you have to check them all at least once.
Note that this will iterate through every item in the array. This is also necessary, since (from the information you've given us) it's potentially possible for any single item in the array to be its own cluster (which would also make it a peak). With around 25 million items in the array, this should only take a few seconds on a modern computer.

This might not be an optimal solution, but since the problem sounds somewhat fluid too, I'll write it down.
Construct a list of all the values (and coordinates) that are over your minimum treshold.
Sort it in descending order of height.
The first element will be the biggest peak, add it to the peak list.
Then descend down the list, if the current element is further than the minimum distance from all the existing peaks, add it to the peak list.
This is a linear description but all the steps (except 3) can be trivially parallelised. In step 4 you can also use a coverage map: a 2D array of booleans that show which coordinates have been "covered" by a nearby peak.
(Caveat emptor: once you refine the criteria, this solution might become completely unfeasible, but in general it works.)

Simulated annealing, or hill climbing are what immediately comes to mind. These algorithms though will not guarantee that all peaks are found.
However if your "peaks" are separated by values of 0 as the gap, maybe a connected components analysis would help. You would label a region as "connected" if it is connected with values greater than 0(or if you have a certain threshold, label regions as connected that are over that threshold), then your number of components would be your number of peaks. You could also then do another pass of the array to find the max of each component.
I should note that connected components can be done in linear time, and finding the peak values can also be done in linear time.

Related

Efficiently gathering data from a game board

Say I have a connect-4 board, it's a 7x6 board, and I want to store what piece is being stored in what spot on that board. Using a 2-array would be nice, on the fact that I can quickly visualize it as a board, but I worry about the efficiency of looping through an array to gather data so often.
What would be the most efficient way of 1) Storing that game board and 2) Gathering the data from the said game board?
Thanks.
The trite answer is that at 7x6, it's not going to be a big deal: Unless you're on a microcontroller this might not make a practical difference. However, if you're thinking about this from an algorithm perspective, you can reason about the operations; "storing" and "gathering" are not quite specific enough. You'll need to think through exactly which operations you're trying to support, and how they would scale if you had thousands of columns and millions of pieces. Operations might be:
Read whether a piece exist, and what color it is, given its x and y coordinates.
When you add a piece to a column, it will "fall". Given a column, how far does it fall, or what would the new y value be for a piece added to column x? At most this will be height times whatever the cost of reading is, since you could just scan the column.
Add a piece at the given x and y coordinate.
Scan through all pieces, which is at most width times height times the cost of reading.
Of course, all of this has to fit on your computer as well, so you care about storage space as well as time.
Let's list out some options:
Array, such as game[x][y] or game[n] where n is something like x * height + y: Constant time (O(1)) to read/write given x and y, but O(width * height) to scan and count, and O(height) time to figure out how far a piece drops. Constant space of O(width * height). Perfectly reasonable for 7x6, might be a bad idea if you had a huge grid (e.g. 7 million x 6 million).
Array such as game[n] where each piece is added to the board and each piece contains its x and y coordinate: O(pieces) time to find/add/delete a piece given x and y, O(pieces) scan time, O(pieces) space. Probably good for an extremely sparse grid (e.g. 7 million x 6 million), but needlessly slow for 7x6.
HashMap as Grant suggests, where the key is a Point data object you write that contains x and y. O(1) to read/write, O(height) to see how far a piece drops, O(pieces) time to scan, O(pieces) space. Slightly better than an array, because you don't need an empty array slot per blank space on the board. There's a little extra memory per piece entry for the HashMap key object, but you could effectively make an enormous board with very little extra cost, which makes this slightly better than option 1 if you don't mind writing the extra Point class.
An array of resizable column arrays, e.g. List. This is similar to an array of fixed arrays, but because List stores its size and can allocate only as much memory as needed, you can store the state very efficiently including how far a piece needs to fall. Constant read/write/add, constant "fall" time, O(pieces) + O(width) scan time, O(pieces) + O(width) space because you don't need to scan/store the cells you know are empty.
Given those options, I think that an array of Lists (#4) is the most scalable solution, but unless I knew it needed to scale I would probably choose the array of arrays (#1) for ease of writing and understanding.
I may be wrong, but I think you're looking a hashmap (a form of hashtable) if you want efficiency.
Here's the documentation:
https://docs.oracle.com/javase/8/docs/api/java/util/Hashtable.html
HashMap provides expected constant-time performance O(1) for most operations like add(), remove() and contains().
Since you're using a 7x6 board, you can simply name your keys and values A1 ... A6 for example.

25th, 50th and 75th percentile of a data structure of size 360 000

My program listens for incoming data and an estimate of 5 data comes in every second. All the data will be stored in a data structure. When the data structure is of size 360 000, I will need to find the 25th, 50th and 75th percentile among the data stored.
Which of the following would be more efficient? Or if you know a better method please help me out.
Should I use an order statistics tree?
Insert, delete (log n).
Or should I wait till it has collected all 360 000 data, then sort it and find the 25th, 50th and 75th percentile from there.
You could use a selection sort to find the different percentiles.
In your problem you know you need to find the 90k, 180k, and 270k positioned elements in a sorted list.
Once all the 360k elements are fetched, choose a random element and split the elements to sublists based on those smaller, equal, and bigger than the element you chose.
After that step, you will be able to see at what position the element you chose was in. Then, you can either choose to do the same with the smaller or bigger sublist, depending on what percentile you are looking for.
In the best case, this could be solved in O(n), as you could choose the right percentiles on the first go, but this is very unlikely.
In the worst case, you could choose always the smallest element, and therefore do passes o(n) times making it o(n^2), but thats very unlikely too.
Luckily, the expected running time turns out to be T(n) <= 8n, which is linear running time.
As a tip, you could gather the min/max numbers during the streaming of the data, and then you can estimate by choosing the first element to choose as (max+min)/2. This will of course be an assumption that the numbers are somehow similar to a normal distribution, and not totally off.
If you need more details on the algorithm, have a look here: http://cseweb.ucsd.edu/~dasgupta/103/4a.pdf

Algorithm for clustering Tweets based on geo radius

I want to cluster tweets based on a specified geo-radius like (10 meters). If for example I specify 10 meters radius, then I want all tweets that are within 10 meters to be in one cluster.
A simple algorithm could be to calculate the distance between each tweet and each other tweets, but that would be very computationally expensive. Are there better algorithms to do this?
You can organize your tweets in a quadtree. This makes it quite easy to find tweets near by without looking at all tweeds and their location.
The quadtree does not directly deliver the distance (because it is based on a Manhatten-distance but it gives you near by tweets, for which you can calculate the precise distance afterwards afterwards.
If your problem is only in computation of distances:
remember: you should never count distances if you need them for comparison only. Use their squares instead.
Do not compare:
sqrt((x1-x2)^2+(y1-y2)^2) against 10
compare instead
(x1-x2)^2+(y1-y2)^2 against 100
It takes GREATLY less time.
The other improvement can be reached if you simply compare coordinates before comparing squares of distances. If abs(x1-x2)>1, you needn't that pair anymore. (It is the Manchattan distance MrSmith is speaking about)
I don't know how you work with your points, but if their set is stable, you could make two arrays of them, and in each one order them according to one of the coordinates. After that you need to check only these points that are close to the source one in both arrays.

Java data structure of 500 million (double) values?

I am generating random edges for a complete graph with 32678 Vertices. So, 500 million + values.
I am using a HashMap to using the edges as key and the random edge weight as the value. I keep encountering:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.lang.StringBuilder.toString(StringBuilder.java:430) at
pa1.Graph.(Graph.java:60) at pa1.Main.main(Main.java:19)
This graph will then be used to construct a Minimum Spanning Tree.
Any ideas on a better data-structure or approach?
I know there are overrides to allocate more memory, but I would prefer a solution that works as-is.
A HashMap will be very large, cause it will contain Doubles (with a capital D) which are significantly larger than 8 bytes. (Not to mention the Entry) Depends on implementation and the CPU chip, but I think it's at least 16 bytes each, and probably more?
I think you should consider keeping the primary data in a huge double[] (or, if you can spare some accuracy, a float[]). That cuts memory usage by an easy 2x or 4x. (500M floats is a "mere" 2GB) Then use integer indexes into this array to implement your edges and vertices. For example, an edge could be an int[2]. This is far from O-O, and there's some serious hand-waving here. (and I don't understand all the nuances of what you are trying to do)
Very "old fashioned" in style, but requires a lot less memory.
Correction - I think an edge might be int[4], a vertex an int[2]. But you get the idea. Actually, for edges and vertices, you will have a smaller number of Objects and for them you can probably use "real" Objects, Maps, etc...
Since it is a complete graph, there is no doubt on what the edges are. How about storing the labels for those edges in a simple list which is ordered in a certain manner? So e.g. if you have 5 nodes, the weights for the edges which would be ordered as follows: {1,2}, {1,3} {1,4} {1,5} {2,3} {2,4} {2,5} {3,4} {3,5} {4,5}.
However, as pointed out by #BillyO'Neal this might still take up 8 GB of space. You might want to split up this list into multiple files and simultaneously maintain an index of these files suggesting where one set of weights ends in one file and where the next set of weights begin.
Additionally, given that you are finding the MST for the graph, you might want to have a look at the following paper as well: http://cvit.iiit.ac.in/papers/Vibhav09Fast.pdf. The paper seems to based off the Boruvka's Algorithm (http://en.wikipedia.org/wiki/Bor%C5%AFvka's_algorithm; http://iss.ices.utexas.edu/?p=projects/galois/benchmarks/mst).

Efficiently counting co-occurrences in a large dataset

Came across this interview programming test recently:
You're given a list of top 50 favorite artists for 1000 users (from last.fm)
Generate a list of all artist pairs that appear together at least 50 times.
The solution can't store in memory, or evaluate all possible pairs.
The solution should be scalable to larger datasets.
The solution doesn't have to be exact, ie you can report pairs with a high probability of meeting the cutoff.
I feel I have a pretty workable solution, but I'm wondering if they were looking for something specific that I missed.
(In case it makes a difference - this isn't from my own interviewing, so I'm not trying to cheat any prospective employers)
Here are my assumptions:
There's a finite maximum number of artists (622K according to MusicBrainz), while there is no limit on the number of users (well, not more than ~7 billion, I guess).
Artists follow a "long tail" distribution: a few are popular, but most are favorited by a very small number of users.
The cutoff, is chosen to select a certain percentage of artists (around 1% with 50 and the given data) so it will increase as the number of users increases.
The third requirement is a little vague - technically, if you have any exact solution you've "evaluated all possible pairs".
Practical Solution
first pass: convert artist names to numeric ids; store converted favorite data in a temp file; keep count of user favorites for each artist.
Requires a string->int map to keep track of assigned ids; can use a Patricia tree if space is more important than speed (needed 1/5th the space and twice the time in my, admittedly not very rigorous, tests).
second pass: iterate over the temp file; throw out artists which didn't, individually, meet the cutoff; keep counts of pairs in a 2d matrix.
Will require n(n-1)/2 bytes (or shorts, or ints, depending on the data size) plus the array reference overhead. Shouldn't be a problem since n is, at most, 0.01-0.05 of 622K.
This seems like it can process any sized real-world dataset using less than 100MB of memory.
Alternate Solution
If you can't do multiple passes (for whatever contrived reason), use an array of Bloom filters to keep the pair counts: for each pair you encounter, find the highest filter it's (probably) in, and add to the next highest one. So, first time it's added to bf[0], second time bf[1], and so on until bf[49]. Or can revert to keeping actual counts after a certain point.
I haven't run the numbers, but the lowest few filters will be quite sizable - it's not my favorite solution, but it could work.
Any other ideas?
You should consider one of the existing approaches for mining association rules. This is a well-researched problem, and it is unlikely that a home-grown solution would be much faster. Some references:
Wikipedia has a non-terrible list of implementations http://en.wikipedia.org/wiki/Association_rule_learning .
Citing a previous answer of mine: What is the most efficient way to access particular elements in a SortedSet? .
There is a repository of existing implementations here: http://fimi.ua.ac.be/src/ . These are tools that participated in a performance competition a few years back; many of them come with indicative papers to explain how/when/why they are faster than other algorithms.
With two points of the requirement being about inexact solution, I'm guessing they're looking for a fast shortcut approximation instead of an exhaustive search. So here's my idea:
Suppose that there is absolutely no correlation between a fan's choices for favorite artists. This is, of course, surely false. Someone who likes Rembrandt is far more likely to also like Rubens then he is to also like Pollock. (You did say we were picking favorite artists, right?) I'll get back to this in a moment.
Then make a pass through the data, counting the number of distinct artists, the number of fans, and how often each artist shows up as a favorite. When we're done making this pass: (1) Eliminate any artists who don't individually show up the required number of "pair times". If an arist only shows up 40 times, he can't possibly be included in more than 40 pairs. (2) For the remaining artists, convert each "like count" to a percentage, i.e. this artist was liked by, say, 10% of the fans. Then for each pair of artists, multiple their like percentages together and then multiply by the total number of fans. This is the estimated number of times they'd show up as a pair.
For example, suppose of 1000 fans, 200 say they like Rembrandt and 100 say they like Michaelangelo. That means 20% for Rembrandt and 10% for Michaelangelo. So if there's no correlation, we'd estimate that 20% * 10% * 1000 = 20 like both. This is below the threshold so we wouldn't include this pair.
The catch to this is that there almost surely is a correlation between "likes". My first thought would be to study some real data and see how much of a correlation there is, that is, how the real pair counts differs from the estimate. If we find that, say, the real count is rarely more than twice the estimated count, then we could just say that any pair that gives an estimate over 1/2 of the threshold we declare a "candidate". Then we do an exhaustive count on the candidates to see how many really meet the condition. This would allow us to eliminate all the pairs that fall well below the threshold as "unlikely" and thus not worth the cost of investigating.
This could miss pairs when the artists almost always occur together. If, say, 100 fans like Picasso, 60 like Van Gogh, and of the 60 who like Van Gogh 50 also like Picasso, their estimate will be MUCH lower than their actual. If this happens rarely enough it may fall into the acceptable "exact answer not required" category. If it happens all the time this approach won't work.

Categories

Resources