Set<> = Sets.newHashSet()

Set<> = Sets.newHashSet() - java

Ok so i reformated the post to make it a little easier to understand (sorry about all the pastebins but stack overflow was being dumb with code formatting)
Please note that I do not intend to have the ridiculous amount of data stored as I state below.
The main reason I use the amount I said is to squeeze out as much efficiency as possible.
Lets say I have the following code(s)
the method that will be adding to the DropItemQueue (starts at floodFill(with depth 0) the other paramaters do not matter)
http://pastebin.com/3wqEb5cM
this is the same class and it will then call the dropItem method in Utils
http://pastebin.com/yEUW9Wad
My Utils.dropItem method is as follows
http://pastebin.com/c0eaWeMA
This is the ServerTickHandler.addDropItemQueue method and its variable storage
http://pastebin.com/Q4p5a4ja
Here is the DropItemQueue class
http://pastebin.com/wxCj9imN
If I was to say add 100000000 elements to this hashset I have noticed that it takes around 2 seconds to
iterate over everything in the hashset using
This iterator is called every 1/20th of a seccond
http://pastebin.com/zSWg1kKu
2 seconds per iteration doesn't seem like much but with that amount of elements to get rid of every single element stored in the hash set it would take around 50 days
every time an element is parsed to the hashset the maxTicks is 1 more than the previous added element so basically every 1 second an item is dropped but due to the 2 seconds to iterate over everything its actually taking 3 seconds to drop an item which would make it take around 150 days to complete the iteration and flush every element out and complete
my question is would it be quicker to have multiple hash sets with less maximum elements lets say 1000 elements.
yes this would give me 100000 hashsets but due to them each being smaller would the iterations times be slower (all be it a very small increase in efficiency) or is there something better I can use other than a hashsets or are hashsets the best thing to use?
do note that if I did use multiple iterations of smaller iterations I could not use threads due to cross thread data redundancy

The HashSet data structure is designed for really only one purpose, which is to answer the question "Does this set contain this item". For any other use it not necessarily the most efficient.
For your use, it seems like a Queue would be a better choice.

Related

What is the Searching difference between various small HashSet and 1 large HashSet?

I did some POC and found that when I search in a large Set of 400 items, it is 6-7 times faster than searching in 20 sets of 20 items each. Although in both cases, hashing is used but how does just looping costs so much ?

Would you expect it to take the same time or 20 times longer? With 20 sets, you need 10.5 lookups on the average (assuming the item is present in exactly one of them), so a factor of 10.5 should result. This is reasonable close to your reported factor of 6-7. As you gave us no code, we can't point to where your benchmark fails. But without reading something about how to benchmark, nobody gets it right.
If you want to know more, provide us with more details.
PS: You should hardly ever use 20 sets the way you're probably using then. A Map<Item, Integer> is much better as a representation of a set partitioning and is as fast as a Set<Item> (actually, a Set is implemented via a Map).

testing a custom data structures big-o complexity

As an assignment, I implemented a custom data structure and a few test cases in order to make sure it works properly.
The code itself is not really needed for the question but you can assume it is some sort of a SortedList.
my problem is that I was also asked to test big-O complexity, e.g make sure put() is o(n) etc.
I am having allot of trouble understanding how can I write such a test.
one way that came to mind was to count the amount of iterations inside the put() method with a simple counter and then checking that it is equal to the size of the list, but that would require me to change the code of the list itself to count the exact number, and I would much rather doing it in a proper way, outside the class, holding only an instance of it.
any ideas anyone? I would really appreciate the help!

With unit testing you test the interface of a class, but the number of iterations is not part of the interface here. You could use a timer to check the runtime behaviour with different sizes. If it's O(n) there should be a linear dependency between time and n.

Get the current time at the start of the test, call the method that you're testing a large number of times, get the current time after, and check the difference to get the execution time. Repeat with a different size input, and check that the execution time follows the expected relationship.
For an o(n) algorithm, a suitable check would be that doubling the input size approximately doubles the execution time.

25th, 50th and 75th percentile of a data structure of size 360 000

My program listens for incoming data and an estimate of 5 data comes in every second. All the data will be stored in a data structure. When the data structure is of size 360 000, I will need to find the 25th, 50th and 75th percentile among the data stored.
Which of the following would be more efficient? Or if you know a better method please help me out.
Should I use an order statistics tree?
Insert, delete (log n).
Or should I wait till it has collected all 360 000 data, then sort it and find the 25th, 50th and 75th percentile from there.

You could use a selection sort to find the different percentiles.
In your problem you know you need to find the 90k, 180k, and 270k positioned elements in a sorted list.
Once all the 360k elements are fetched, choose a random element and split the elements to sublists based on those smaller, equal, and bigger than the element you chose.
After that step, you will be able to see at what position the element you chose was in. Then, you can either choose to do the same with the smaller or bigger sublist, depending on what percentile you are looking for.
In the best case, this could be solved in O(n), as you could choose the right percentiles on the first go, but this is very unlikely.
In the worst case, you could choose always the smallest element, and therefore do passes o(n) times making it o(n^2), but thats very unlikely too.
Luckily, the expected running time turns out to be T(n) <= 8n, which is linear running time.
As a tip, you could gather the min/max numbers during the streaming of the data, and then you can estimate by choosing the first element to choose as (max+min)/2. This will of course be an assumption that the numbers are somehow similar to a normal distribution, and not totally off.
If you need more details on the algorithm, have a look here: http://cseweb.ucsd.edu/~dasgupta/103/4a.pdf

how to improve MappedByteBuffer get performance for my use case?

I have several large double and long arrays of 100k values each that needs to be accessed for computation at a given time, even with largeHeap requested the Android OS doesnt give me enough memory and i keep getting outofmemory exceptions in most of tested devices. So i went researching for ways to overcome this, and according to an answer i got from Waldheinz in my previous question i implemented an array that is based on a file, using RandomAccessMemory to get a channel to it, then map it using MappedByteBuffer as suggested, and use the MappedByteBuffer asLongBuffer or asDoubleBuffer. this works perfect, i 100% eliminated the outofmemory exceptions. but the performance is very poor. i get lot of calls to get(some index) that takes about 5-15 miliseconds each and therefore user exprience is ruined
some usefull information :
i am using binary search on the arrays to find a start and end indices and then i have a linear loop from start to end
I added a print command for any get() calls that takes more then 5 mili seconds to finish (printing out time it took,requested index and the last requested index), seems like all of the binary search get requests were printed, and few of the linear requests were too.
Any suggestions on how to make it go faster?

Approach 1
Index your data - add pointers for quick searching
Split your sorted data into 1000 buckets 100 values each
Maintain an index referencing each bucket's start and end
The algorithm is to first find your bucket in this memory index (even a loop is fine for this) and then to jump to this bucket in a memory mapped file
This will result into a single jump over a file (a single bucket to find) and an iteration on 100 elements max.
Approach 2
Utilize a lightweight embedded database. I.e. MapDB supports Android.

Efficiently counting co-occurrences in a large dataset

Came across this interview programming test recently:
You're given a list of top 50 favorite artists for 1000 users (from last.fm)
Generate a list of all artist pairs that appear together at least 50 times.
The solution can't store in memory, or evaluate all possible pairs.
The solution should be scalable to larger datasets.
The solution doesn't have to be exact, ie you can report pairs with a high probability of meeting the cutoff.
I feel I have a pretty workable solution, but I'm wondering if they were looking for something specific that I missed.
(In case it makes a difference - this isn't from my own interviewing, so I'm not trying to cheat any prospective employers)
Here are my assumptions:
There's a finite maximum number of artists (622K according to MusicBrainz), while there is no limit on the number of users (well, not more than ~7 billion, I guess).
Artists follow a "long tail" distribution: a few are popular, but most are favorited by a very small number of users.
The cutoff, is chosen to select a certain percentage of artists (around 1% with 50 and the given data) so it will increase as the number of users increases.
The third requirement is a little vague - technically, if you have any exact solution you've "evaluated all possible pairs".
Practical Solution
first pass: convert artist names to numeric ids; store converted favorite data in a temp file; keep count of user favorites for each artist.
Requires a string->int map to keep track of assigned ids; can use a Patricia tree if space is more important than speed (needed 1/5th the space and twice the time in my, admittedly not very rigorous, tests).
second pass: iterate over the temp file; throw out artists which didn't, individually, meet the cutoff; keep counts of pairs in a 2d matrix.
Will require n(n-1)/2 bytes (or shorts, or ints, depending on the data size) plus the array reference overhead. Shouldn't be a problem since n is, at most, 0.01-0.05 of 622K.
This seems like it can process any sized real-world dataset using less than 100MB of memory.
Alternate Solution
If you can't do multiple passes (for whatever contrived reason), use an array of Bloom filters to keep the pair counts: for each pair you encounter, find the highest filter it's (probably) in, and add to the next highest one. So, first time it's added to bf[0], second time bf[1], and so on until bf[49]. Or can revert to keeping actual counts after a certain point.
I haven't run the numbers, but the lowest few filters will be quite sizable - it's not my favorite solution, but it could work.
Any other ideas?

You should consider one of the existing approaches for mining association rules. This is a well-researched problem, and it is unlikely that a home-grown solution would be much faster. Some references:
Wikipedia has a non-terrible list of implementations http://en.wikipedia.org/wiki/Association_rule_learning .
Citing a previous answer of mine: What is the most efficient way to access particular elements in a SortedSet? .
There is a repository of existing implementations here: http://fimi.ua.ac.be/src/ . These are tools that participated in a performance competition a few years back; many of them come with indicative papers to explain how/when/why they are faster than other algorithms.

With two points of the requirement being about inexact solution, I'm guessing they're looking for a fast shortcut approximation instead of an exhaustive search. So here's my idea:
Suppose that there is absolutely no correlation between a fan's choices for favorite artists. This is, of course, surely false. Someone who likes Rembrandt is far more likely to also like Rubens then he is to also like Pollock. (You did say we were picking favorite artists, right?) I'll get back to this in a moment.
Then make a pass through the data, counting the number of distinct artists, the number of fans, and how often each artist shows up as a favorite. When we're done making this pass: (1) Eliminate any artists who don't individually show up the required number of "pair times". If an arist only shows up 40 times, he can't possibly be included in more than 40 pairs. (2) For the remaining artists, convert each "like count" to a percentage, i.e. this artist was liked by, say, 10% of the fans. Then for each pair of artists, multiple their like percentages together and then multiply by the total number of fans. This is the estimated number of times they'd show up as a pair.
For example, suppose of 1000 fans, 200 say they like Rembrandt and 100 say they like Michaelangelo. That means 20% for Rembrandt and 10% for Michaelangelo. So if there's no correlation, we'd estimate that 20% * 10% * 1000 = 20 like both. This is below the threshold so we wouldn't include this pair.
The catch to this is that there almost surely is a correlation between "likes". My first thought would be to study some real data and see how much of a correlation there is, that is, how the real pair counts differs from the estimate. If we find that, say, the real count is rarely more than twice the estimated count, then we could just say that any pair that gives an estimate over 1/2 of the threshold we declare a "candidate". Then we do an exhaustive count on the candidates to see how many really meet the condition. This would allow us to eliminate all the pairs that fall well below the threshold as "unlikely" and thus not worth the cost of investigating.
This could miss pairs when the artists almost always occur together. If, say, 100 fans like Picasso, 60 like Van Gogh, and of the 60 who like Van Gogh 50 also like Picasso, their estimate will be MUCH lower than their actual. If this happens rarely enough it may fall into the acceptable "exact answer not required" category. If it happens all the time this approach won't work.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.