Efficiently counting co-occurrences in a large dataset

Efficiently counting co-occurrences in a large dataset - java

Came across this interview programming test recently:
You're given a list of top 50 favorite artists for 1000 users (from last.fm)
Generate a list of all artist pairs that appear together at least 50 times.
The solution can't store in memory, or evaluate all possible pairs.
The solution should be scalable to larger datasets.
The solution doesn't have to be exact, ie you can report pairs with a high probability of meeting the cutoff.
I feel I have a pretty workable solution, but I'm wondering if they were looking for something specific that I missed.
(In case it makes a difference - this isn't from my own interviewing, so I'm not trying to cheat any prospective employers)
Here are my assumptions:
There's a finite maximum number of artists (622K according to MusicBrainz), while there is no limit on the number of users (well, not more than ~7 billion, I guess).
Artists follow a "long tail" distribution: a few are popular, but most are favorited by a very small number of users.
The cutoff, is chosen to select a certain percentage of artists (around 1% with 50 and the given data) so it will increase as the number of users increases.
The third requirement is a little vague - technically, if you have any exact solution you've "evaluated all possible pairs".
Practical Solution
first pass: convert artist names to numeric ids; store converted favorite data in a temp file; keep count of user favorites for each artist.
Requires a string->int map to keep track of assigned ids; can use a Patricia tree if space is more important than speed (needed 1/5th the space and twice the time in my, admittedly not very rigorous, tests).
second pass: iterate over the temp file; throw out artists which didn't, individually, meet the cutoff; keep counts of pairs in a 2d matrix.
Will require n(n-1)/2 bytes (or shorts, or ints, depending on the data size) plus the array reference overhead. Shouldn't be a problem since n is, at most, 0.01-0.05 of 622K.
This seems like it can process any sized real-world dataset using less than 100MB of memory.
Alternate Solution
If you can't do multiple passes (for whatever contrived reason), use an array of Bloom filters to keep the pair counts: for each pair you encounter, find the highest filter it's (probably) in, and add to the next highest one. So, first time it's added to bf[0], second time bf[1], and so on until bf[49]. Or can revert to keeping actual counts after a certain point.
I haven't run the numbers, but the lowest few filters will be quite sizable - it's not my favorite solution, but it could work.
Any other ideas?

You should consider one of the existing approaches for mining association rules. This is a well-researched problem, and it is unlikely that a home-grown solution would be much faster. Some references:
Wikipedia has a non-terrible list of implementations http://en.wikipedia.org/wiki/Association_rule_learning .
Citing a previous answer of mine: What is the most efficient way to access particular elements in a SortedSet? .
There is a repository of existing implementations here: http://fimi.ua.ac.be/src/ . These are tools that participated in a performance competition a few years back; many of them come with indicative papers to explain how/when/why they are faster than other algorithms.

With two points of the requirement being about inexact solution, I'm guessing they're looking for a fast shortcut approximation instead of an exhaustive search. So here's my idea:
Suppose that there is absolutely no correlation between a fan's choices for favorite artists. This is, of course, surely false. Someone who likes Rembrandt is far more likely to also like Rubens then he is to also like Pollock. (You did say we were picking favorite artists, right?) I'll get back to this in a moment.
Then make a pass through the data, counting the number of distinct artists, the number of fans, and how often each artist shows up as a favorite. When we're done making this pass: (1) Eliminate any artists who don't individually show up the required number of "pair times". If an arist only shows up 40 times, he can't possibly be included in more than 40 pairs. (2) For the remaining artists, convert each "like count" to a percentage, i.e. this artist was liked by, say, 10% of the fans. Then for each pair of artists, multiple their like percentages together and then multiply by the total number of fans. This is the estimated number of times they'd show up as a pair.
For example, suppose of 1000 fans, 200 say they like Rembrandt and 100 say they like Michaelangelo. That means 20% for Rembrandt and 10% for Michaelangelo. So if there's no correlation, we'd estimate that 20% * 10% * 1000 = 20 like both. This is below the threshold so we wouldn't include this pair.
The catch to this is that there almost surely is a correlation between "likes". My first thought would be to study some real data and see how much of a correlation there is, that is, how the real pair counts differs from the estimate. If we find that, say, the real count is rarely more than twice the estimated count, then we could just say that any pair that gives an estimate over 1/2 of the threshold we declare a "candidate". Then we do an exhaustive count on the candidates to see how many really meet the condition. This would allow us to eliminate all the pairs that fall well below the threshold as "unlikely" and thus not worth the cost of investigating.
This could miss pairs when the artists almost always occur together. If, say, 100 fans like Picasso, 60 like Van Gogh, and of the 60 who like Van Gogh 50 also like Picasso, their estimate will be MUCH lower than their actual. If this happens rarely enough it may fall into the acceptable "exact answer not required" category. If it happens all the time this approach won't work.

Related

Best Searching algorithms to find a word from millions of words in Java

Recently, in an interview I was asked which data structure/library would you use, to search a particular word from among multi-million of words.
I said, HashSet, as its executes the search operation, on average, in almost a constant-time. I also mentioned that we should initialize it with double the known number of elements, so that I will have lesser hash collision as the load would get divided among larger number of buckets and the individual linked list would be of half the size (would it really enhance the performance?). But then again, the interviewer asked me if that would suffice and was expecting something more.
What else can we do to achieve a higher level of search optimization or any other data structure/library that can help increase the efficiency?

What is the Searching difference between various small HashSet and 1 large HashSet?

I did some POC and found that when I search in a large Set of 400 items, it is 6-7 times faster than searching in 20 sets of 20 items each. Although in both cases, hashing is used but how does just looping costs so much ?

Would you expect it to take the same time or 20 times longer? With 20 sets, you need 10.5 lookups on the average (assuming the item is present in exactly one of them), so a factor of 10.5 should result. This is reasonable close to your reported factor of 6-7. As you gave us no code, we can't point to where your benchmark fails. But without reading something about how to benchmark, nobody gets it right.
If you want to know more, provide us with more details.
PS: You should hardly ever use 20 sets the way you're probably using then. A Map<Item, Integer> is much better as a representation of a set partitioning and is as fast as a Set<Item> (actually, a Set is implemented via a Map).

Is it possible to find every possible combination of a tie for the electoral college in a program in a relatively short amount of time?

My computer science teacher has assigned this problem to us, and just about everyone in our class up-roared over the complexity of the problem. We are only in Advanced Topics of Computer Science in High school and none of us really have no idea where to start, what algorithms to use or anything. We have determined that going straight though every possible combination, there would be 2^50th combinations to run though which is way WAY to big for really any of us to search for. I'm just curious if this is even possible to do at our low Computer Science skill level and if anyone personally thinks that this is a fair problem because our teacher still hasn't found a solution to his own problem.
Thanks!

The solution space is not really 2^50. A tie (assuming only two candidates) means 269-269. You can't get to 269 with only one state (or even only a handful of states) so you can immediately throw out all small subsets and all large subsets (winning every state also doesn't work). Furthermore, you only need to look for subsets that total 269 (because there are 538 total, that means that the complement of each of those sets is also 269).
That said, this still boils down to the subset sum problem: (https://en.wikipedia.org/wiki/Subset_sum_problem) so any solution will not scale well (unless you figure out how to do it in polynomial time, in which case you can claim $1,000,000). However, your problem is not to scale it; for the case of the US electoral college configuration (including vote splits in some states) it is not too large to figure out in a reasonable (< 10 mins as you say) amount of time.

The solution space is smaller than it seems, since some states have the same number of electoral votes. For example, Florida and New York both have 29 electoral votes, so there are really just three cases, not four: both on the left, both on the right, and one on each side (which should be double-counted since this can happen in two ways). This reduces the number of cases to 6.2 * 10^9, over five orders of magnitude smaller than 2^51 (although, in exchange, there's a slight amount of extra work determining how many cases you're representing). Even without further optimization this is small enough to iterate over fairly quickly.
This PARI/GP script
EV=[55,38,29,20,18,16,15,14,13,12,11,10,9,8,7,6,5,4,3]~;
count=[1,1,2,2,1,2,1,1,1,1,4,4,3,2,3,6,3,5,8];
s=0; forvec(v=vector(#count,i,[0,count[1]]), if(v*EV==269, s+=prod(i=1,#count, binomial(count[i],v[i])))); s
yields an answer within milliseconds.
This version doesn't attempt to handle third-party candidates, split state votes, etc.

BTrees and Disk Persistance

For some time i am working on creating index for very large data sets (around 190 million). I have a BTree which can insert data sets (typically an object)/search for key and while i searched how to persist the data into files in disk, i came across this amazing article (http://www.javaworld.com/article/2076333/java-web-development/use-a-randomaccessfile-to-build-a-low-level-database.html#resources). This pretty much gives me the starting point.
Here they are indexing String key to binary object (blob). They have the file format where they have divided it into 3 regions, header(stores start point of indexes), index(stores index and its corresponding location) and data region (stores data). They are using RandomAccessFile to get the data.
How do i define similar file format for btree. All i know is for every read made to disk, i have to get one node(typically one block 512 bytes). There are many similar questions on how to persist but it is little difficult to understand the big picture on why we decide on something that we implemented like this question (Persisting B-Tree nodes to RandomAccessFile -[SOLVED]). Please share your thoughts.

Here is an alternative take on the question, based on problem specifics that have become known in the meantime. This post is based on the following assumptions:
record count about 190 million, fixed
keys are 64-byte hashes, like SHA-256
values are filenames: variable length, but sensible (average length < 64 bytes, max < page)
page size 4 KiByte
Efficient representation of filenames in a database is a different topic that cannot be addressed here. Should the filenames be awkward - longish on average and/or Unicode - then the hashing solution will punish you with increased disk read counts (more overflows, more chaining) or reduced average occupancy (more wasted space). A B-tree solution reacts somewhat more benignly, though, since an optimum tree can be constructed in any case.
The most efficient solution in this situation - and the simplest to implement by a wide margin - is hashing, since your keys are perfect hashes already. Take the first 23 bits of the hash as the page number, and lay out the pages like this:
page header
uint32_t next_page
uint16_t key_count
key/offset vector
uint16_t value_offset;
byte key[64];
... unallocated space ...
last arrived filename
...
2nd arrived filename
1st arrived filename
Values (filenames) are stored from the end of the page downwards, prefixed with their 16-bit length, and the key/offset vector grows upwards. That way neither low/high key counts nor short/long values can cause unnecessary waste of space, as would be the case with fixed-size structures. Nor do you have to parse variable-length structures during key searches. Apart from that I've aimed for the greatest possible simplicity - no premature optimisation. The bottom of the heap can be stored in the page header, in KO.[PH.key_count].value_offset (my preference), or computed as KO.Take(PH.key_count).Select(r => r.value_offset).Min(), whatever pleases you most.
The key/offset vector needs to be kept sorted on the keys so that you can use binary search but the values can be written as they arrive, they do not need to be in any particular order. If the page overflows, allocate a new one just like it at the current end of the file (growing the file by one page) and stash its page number in the appropriate header slot. This means that you can binary search within a page but all chained pages need to be read and searched one by one. Also, you do not need any kind of file header, since the file size is otherwise available and that's the only piece of global management information that needs to be maintained.
Create the file as a sparse file with the number of pages as indicated by your chosen number of hash key bits (e.g. 8388608 pages for 23 bits). Empty pages in a sparse file don't take up any disk space and read as all 0s, which works perfectly fine with our page layout/semantics. Extend the file by one page whenever you need to allocate an overflow page. Note: the 'sparse file' thing isn't very important here since almost all pages will have been written to when you're done building the file.
For maximum efficiency you need to run some analyses on your data. In my simulation - with random numbers as stand-ins for the hashes, and on the assumption that average filename size is 62 bytes or less - the optimum turned out to be making 2^23 = 8388608 buckets/pages. This means that you take the first 23 bit of the hash as the page number to load. Here are the details:
# bucket statistics for K = 23 and N = 190000000 ... 7336,5 ms
average occupancy 22,6 records
0 empty buckets (min: 3 records)
310101/8388608 buckets with 32+ records (3,7%)
That keeps the chaining to a minimum, on average you need to read just 1.04 pages per search. Increasing the hash key size by one single bit to 24 reduces the expected number of overflowing pages to 3 but doubles the file size and reduces average occupancy to 11.3 records per page/bucket. Reducing the key to 22 bits means that almost all pages (98.4%) can be expected to overflow - meaning the file is virtually the same size as that for 23 bits but you have to do twice as many disk reads per search.
Hence you see how important it is to run a detailed analysis on the data to decide on the proper number of bits to use for hash addressing. You should run an analysis that uses the actual filename sizes and tracks the per-page overhead, to see what the actual picture looks like for 22 bits to 24 bits. It'll take a while to run but that's still way faster than building a multi-gigabyte file blindly and then finding that you have wasted 70% of space or that searches take significantly more than 1.05 page reads on average.
Any B-tree based solution would be much more involved (read: complicated) but could not reduce the page read count per search below 1.000, for obvious reasons, and even that only on the assumption that a sufficient number of internal nodes can be kept cached in memory. If your system has such humongous amounts of RAM that data pages can be cached to a significant degree then the hashing solution will benefit just as much as one that is based on some kind of B-tree.
As much as I would like an excuse for building a screamingly fast hybrid radix/B+tree, the hashing solution delivers essentially the same performance for a tiny fraction of the effort. The only thing where B-treeish solutions can outdo hashing here is space efficiency, since it is trivial to construct an optimum tree for existing pre-sorted data.

The are plenty of Open Source key/value stores and full database engines - take a week off and start Googling. Even if you end up using none of them, you still need to study a representative cross section (architecture, design histories, key implementation details) to get enough of an overview over the subject matter so that you can make informed decisions and ask intelligent questions. For a brief overview, try to Google details on index file formats, both historic ones like IDX or NTX, and current ones used in various database engines.
If you want to roll your own then you might consider hitching yourself to the bandwagon of an existing format, like the dBASE variants Clipper and Visual FoxPro (my favourite). This gives you the ability to work your data with existing tools, including Total Commander plugins and whatnot. You don't need to support the full formats, just the single binary instance of the format that you choose for your project. Great for debugging, reindexing, ad hoc queries and so on. The format itself is dead simple and easy to generate even if you don't use any of the existing libraries. The index file formats aren't quite as trivial but still manageable.
If you want to roll your own from scratch then you've got quite a road ahead of you, since the basics of intra-node (intra-page) design and practice are poorly represented on the Internet and in literature. For example, some old DDJ issues contained articles about efficient key matching in connection with prefix truncation (a.k.a. 'prefix compression') and so on but I found nothing comparable out there on the 'net at the moment, except buried deeply in some research papers or source code repositories.
The single most important item here is the algorithm for searching prefix-truncated keys efficiently. Once you've got that, the rest more or less falls into place. I have found only one resource on the 'net, which is this DDJ (Dr Dobb's Journal) article:
Supercharging Sequential Searches by Walter Williams
A lot of tricks can also be gleaned from papers like
Efficient index compression in DB2 LUW
For more details and pretty much everything else you could do a lot worse than reading the following two books cover to cover (both of them!):
Goetz Graefe: Modern B-Tree Techniques (ISBN 1601984820)
Jim Gray: Transaction Processing. Concepts and Techniques (ISBN 1558601902)
An alternative to the latter might be
Philip E. Bernstein: Principles of Transaction Processing (ISBN 1558606238)
It covers a similar spectrum and it seems to be a bit more hands-on, but it does not seem to have quite the same depth. I cannot say for certain, though (I've ordered a copy but haven't got it yet).
These books give you a complete overview over all that's involved, and they are virtually free of fat - i.e. you need to know almost everything that's in there. They will answer gazillions of questions that you didn't know you had, or that you should have asked yourself. And they cover the whole ground - from B-tree (and B+tree) basics to detailed implementation issues like concurrency, locking, page replacement strategies and so forth. And they enable you to utilise the information that is scattered over the 'net, like articles, papers, implementation notes and source code.
Having said that, I'd recommend matching the node size to the architecture's RAM page size (4 KB or 8 KB), because then you can utilise the paging infrastructure of your OS instead of running afoul of it. And you're probably better off keeping index and blob data in separate files. Otherwise you couldn't put them on different volumes and the data would b0rken the caching of the index pages in subsystems that are not part of your program (hardware, OS and so forth).
I'd definitely go with a B+tree structure instead of watering down the index pages with data as in a normal B-tree. I'd also recommend using an indirection vector (Graefe has some interesting details there) in connection with length-prefixed keys. Treat the keys as raw bytes and keep all the collation/normalisation/upper-lower nonsense out of your core engine. Users can feed you UTF8 if they want - you don't want to have to care about that, trust me.
There is something to be said for using only suffix truncation in internal nodes (i.e. for distinguishing between 'John Smith' and 'Lucky Luke', 'K' or 'L' work just as well as the given keys) and only prefix truncation in leaves (i.e. instead of 'John Smith' and 'John Smythe' you store 'John Smith' and 7+'ythe').
It simplifies the implementation, and gives you most of the bang that could be got. I.e. shared prefixes tend to be very common at the leaf level (between neighbouring records in index order) but not so much in internal nodes, i.e. at higher index levels. Conversely, the leaves need to store the full keys anyway and so there's nothing to truncate and throw away there, but internal nodes only need to route traffic and you can fit a lot more truncated keys in a page than non-truncated ones.
Key matching against a page full of prefix-truncated keys is extremely efficient - on average you compare a lot less than one character per key - but it's still a linear scan, even with all the hopping forward based on skip counts. This limits effective page sizes somewhat, since binary search is more complicated in the face of truncated keys. Graefe has a lot of details on that. One workaround for enabling bigger node sizes (many thousands of keys instead of hundreds) is to lay out the node like a mini B-tree with two or three levels. It can make things lightning-fast (especially if you respect magic thresholds like 64-byte cache line size), but it also makes the code hugely more complicated.
I'd go with a simple lean and mean design (similar in scope to IDA's key/value store), or use an existing product/library, unless you are in search of a new hobby...

Are Bit Set really faster than Sorted Set operations?

I am looking around for the best algorithms for the bitset operations like intersection and union, and found a lot of links and similar questions also.
Eg: Similar Question on Stack-Overflow
One thing however, which I am trying to understand is that where bit set stands into this. Eg, Lucene has taken BitSet operations to give a high performing set operations, specially because it can work at a lower level.
However, what looks to me is, the bit-set will start performing slow and slow, as the number of elements increase and the set is sparse, say set has ~10 elements where the max number of elements can be 2 Billion, because that will call out for unnecessary matching. What do you suggest ?

Bit Sets indeed make sense for dense sets, i.e. covering a significant fraction of the domain, as they represent every possible element. The space and running time requirements are O(D) [D = domain size = 2 billion !].
Sorted Set operations represent only the elements in the given set and will have an O(E) behavior [E = number of elements = 10], much more appropriate.
Bit Sets are fast, they are not efficient. I mean their hidden constant is smaller. They are blazingly fast for small sets (say D <= 1024) as they can process 32/64 elements in a single CPU instruction.

For sparse bitsets you can greatly improve performance (and reduce memory usage) using sparse bitmaps where you divide your data into chunks as opposed to storing everything under a single key.
When using bitmaps for analytics, you have a limited number of users active at any given time (e.g. day) and sparse bitmaps use this fact to their advantage.
Shameless plug: http://github.com/bilus/redis-bitops (if you're using Ruby but there are also performance notes there).

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.