How to multiply two big big numbers

How to multiply two big big numbers - java

You are given a list of n numbers L=<a_1, a_2,...a_n>. Each of them is
either 0 or of the form +/- 2k, 0 <= k <= 30. Describe and implement an
algorithm that returns the largest product of a CONTINUOUS SUBLIST
p=a_i*a_i+1*...*a_j, 1 <= i <= j <= n.
For example, for the input <8 0 -4 -2 0 1> it should return 8 (either 8
or (-4)*(-2)).
You can use any standard programming language and can assume that
the list is given in any standard data structure, e.g. int[],
vector<int>, List<Integer>, etc.
What is the computational complexity of your algorithm?

In my first answer I addressed the OP's problem in "multiplying two big big numbers". As it turns out, this wish is only a small part of a much bigger problem which I'm going to address now:
"I still haven't arrived at the final skeleton of my algorithm I wonder if you could help me with this."
(See the question for the problem description)
All I'm going to do is explain the approach Amnon proposed in little more detail, so all the credit should go to him.
You have to find the largest product of a continuous sublist from a list of integers which are powers of 2. The idea is to:
Compute the product of every continuous sublist.
Return the biggest of all these products.
You can represent a sublist by its start and end index. For start=0 there are n-1 possible values for end, namely 0..n-1. This generates all sublists that start at index 0. In the next iteration, You increment start by 1 and repeat the process (this time, there are n-2 possible values for end). This way You generate all possible sublists.
Now, for each of these sublists, You have to compute the product of its elements - that is come up with a method computeProduct(List wholeList, int startIndex, int endIndex). You can either use the built in BigInteger class (which should be able to handle the input provided by Your assignment) to save You from further trouble or try to implement a more efficient way of multiplication as described by others. (I would start with the simpler approach since it's easier to see if Your algorithm works correctly and first then try to optimize it.)
Now that You're able to iterate over all sublists and compute the product of their elements, determining the sublist with the maximum product should be the easiest part.
If it's still to hard for You to make the connections between two steps, let us know - but please also provide us with a draft of Your code as You work on the problem so that we don't end up incrementally constructing the solution and You copy&pasting it.
edit: Algorithm skeleton
public BigInteger listingSublist(BigInteger[] biArray)
{
int start = 0;
int end = biArray.length-1;
BigInteger maximum;
for (int i = start; i <= end; i++)
{
for (int j = i; j <= end; j++)
{
//insert logic to determine the maximum product.
computeProduct(biArray, i, j);
}
}
return maximum;
}
public BigInteger computeProduct(BigInteger[] wholeList, int startIndex,
int endIndex)
{
//insert logic here to return
//wholeList[startIndex].multiply(wholeList[startIndex+1]).mul...(
// wholeList[endIndex]);
}

Since k <= 30, any integer i = 2k will fit into a Java int. However the product of such two integers might not necessarily fit into a Java int since 2k * 2k = 22*k <= 260 which fill into a Java long. This should answer Your question regarding the "(multiplication of) two numbers...".
In case that You might want to multiply more than two numbers, which is implied by Your assignment saying "...largest product of a CONTINUOUS SUBLIST..." (a sublist's length could be > 2), have a look at Java's BigInteger class.

Actually, the most efficient way of multiplication is doing addition instead. In this special case all you have is numbers that are powers of two, and you can get the product of a sublist by simply adding the expontents together (and counting the negative numbers in your product, and making it a negative number in case of odd negatives).
Of course, to store the result you may need the BigInteger, if you run out of bits. Or depending on how the output should look like, just say (+/-)2^N, where N is the sum of the exponents.
Parsing the input could be a matter of switch-case, since you only have 30 numbers to take care of. Plus the negatives.
That's the boring part. The interesting part is how you get the sublist that produces the largest number. You can take the dumb approach, by checking every single variation, but that would be an O(N^2) algorithm in the worst case (IIRC). Which is really not very good for longer inputs.
What can you do? I'd probably start from the largest non-negative number in the list as a sublist, and grow the sublist to get as many non-negative numbers in each direction as I can. Then, having all the positives in reach, proceed with pairs of negatives on both sides, eg. only grow if you can grow on both sides of the list. If you cannot grow in both directions, try one direction with two (four, six, etc. so even) consecutive negative numbers. If you cannot grow even in this way, stop.
Well, I don't know if this alogrithm even works, but if it (or something similar) does, its an O(N) algorithm, which means great performance. Lets try it out! :-)

Hmmm.. since they're all powers of 2, you can just add the exponent instead of multiplying the numbers (equivalent to taking the logarithm of the product). For example, 2^3 * 2^7 is 2^(7+3)=2^10.
I'll leave handling the sign as an exercise to the reader.
Regarding the sublist problem, there are less than n^2 pairs of (begin,end) indices. You can check them all, or try a dynamic programming solution.

EDIT: I adjusted the algorithm outline to match the actual pseudo code and put the complexity analysis directly into the answer:
Outline of algorithm
Go seqentially over the sequence and store value and first/last index of the product (positive) since the last 0. Do the same for another product (negative) which only consists of the numbers since the first sign change of the sequence. If you hit a negative sequence element swap the two products (positive and negative) along with the associagted starting indices. Whenever the positive product hits a new maximum store it and the associated start and end indices. After going over the whole sequence the result is stored in the maximum variables.
To avoid overflow calculate in binary logarithms and an additional sign.
Pseudo code
maxProduct = 0
maxProductStartIndex = -1
maxProductEndIndex = -1
sequence.push_front( 0 ) // reuses variable intitialization of the case n == 0
for every index of sequence
n = sequence[index]
if n == 0
posProduct = 0
negProduct = 0
posProductStartIndex = index+1
negProductStartIndex = -1
else
if n < 0
swap( posProduct, negProduct )
swap( posProductStartIndex, negProductStartIndex )
if -1 == posProductStartIndex // start second sequence on sign change
posProductStartIndex = index
end if
n = -n;
end if
logN = log2(n) // as indicated all arithmetic is done on the logarithms
posProduct += logN
if -1 < negProductStartIndex // start the second product as soon as the sign changes first
negProduct += logN
end if
if maxProduct < posProduct // update current best solution
maxProduct = posProduct
maxProductStartIndex = posProductStartIndex
maxProductEndIndex = index
end if
end if
end for
// output solution
print "The maximum product is " 2^maxProduct "."
print "It is reached by multiplying the numbers from sequence index "
print maxProductStartIndex " to sequence index " maxProductEndIndex
Complexity
The algorithm uses a single loop over the sequence so its O(n) times the complexity of the loop body. The most complicated operation of the body is log2. Ergo its O(n) times the complexity of log2. The log2 of a number of bounded size is O(1) so the resulting complexity is O(n) aka linear.

I'd like to combine Amnon's observation about multiplying powers of 2 with one of mine concerning sublists.
Lists are terminated hard by 0's. We can break the problem down into finding the biggest product in each sub-list, and then the maximum of that. (Others have mentioned this).
This is my 3rd revision of this writeup. But 3's the charm...
Approach
Given a list of non-0 numbers, (this is what took a lot of thinking) there are 3 sub-cases:
The list contains an even number of negative numbers (possibly 0). This is the trivial case, the optimum result is the product of all numbers, guaranteed to be positive.
The list contains an odd number of negative numbers, so the product of all numbers would be negative. To change the sign, it becomes necessary to sacrifice a subsequence containing a negative number. Two sub-cases:
a. sacrifice numbers from the left up to and including the leftmost negative; or
b. sacrifice numbers from the right up to and including the rightmost negative.
In either case, return the product of the remaining numbers. Having sacrificed exactly one negative number, the result is certain to be positive. Pick the winner of (a) and (b).
Implementation
The input needs to be split into subsequences delimited by 0. The list can be processed in place if a driver method is built to loop through it and pick out the beginnings and ends of non-0 sequences.
Doing the math in longs would only double the possible range. Converting to log2 makes arithmetic with large products easier. It prevents program failure on large sequences of large numbers. It would alternatively be possible to do all math in Bignums, but that would probably perform poorly.
Finally, the end result, still a log2 number, needs to be converted into printable form. Bignum comes in handy there. There's new BigInteger("2").pow(log); which will raise 2 to the power of log.
Complexity
This algorithm works sequentially through the sub-lists, only processing each one once. Within each sub-list, there's the annoying work of converting the input to log2 and the result back, but the effort is linear in the size of the list. In the worst case, the sum of much of the list is computed twice, but that's also linear complexity.

See this code. Here I implement exact factorial of a huge large number. I am just using integer array to make big numbers. Download the code from Planet Source Code.

Related

Subset sum problem with continuous subset using recursion

I am trying to think how to solve the Subset sum problem with an extra constraint: The subset of the array needs to be continuous (the indexes needs to be). I am trying to solve it using recursion in Java.
I know the solution for the non-constrained problem: Each element can be in the subset (and thus I perform a recursive call with sum = sum - arr[index]) or not be in it (and thus I perform a recursive call with sum = sum).
I am thinking about maybe adding another parameter for knowing weather or not the previous index is part of the subset, but I don't know what to do next.

You are on the right track.
Think of it this way:
for every entry you have to decide: do you want to start a new sum at this point or skip it and reconsider the next entry.
a + b + c + d contains the sum of b + c + d. Do you want to recompute the sums?
Maybe a bottom-up approach would be better

The O(n) solution that you asked for:
This solution requires three fixed point numbers: The start and end indices, and the total sum of the span
Starting from element 0 (or from the end of the list if you want) increase the end index until the total sum is greater than or equal to the desired value. If it is equal, you've found a subset sum. If it is greater, move the start index up one and subtract the value of the previous start index. Finally, if the resulting total is greater than the desired value, move the end index back until the sum is less than the desired value. In the other case (where the sum is less) move the end index forward until the sum is greater than the desired value. If no match is found, repeat
So, caveats:
Is this "fairly obvious"? Maybe, maybe not. I was making assumptions about order of magnitude similarity when I said both "fairly obvious" and o(n) in my comments
Is this actually o(n)? It depends a lot on how similar (in terms of order of magnitude (digits in the number)) the numbers in the list are. The closer all the numbers are to each other, the fewer steps you'll need to make on the end index to test if a subset exists. On the other hand, if you have a couple of very big numbers (like in the thousands) surrounded by hundreds of pretty small numbers (1's and 2's and 3's) the solution I've presented will get closers to O(n^2)
This solution only works based on your restriction that the subset values are continuous

How can I get the most frequent 100 numbers out of 4,000,000,000 numbers?

Yesterday in a coding interview I was asked how to get the most frequent 100 numbers out of 4,000,000,000 integers (may contain duplicates), for example:
813972066
908187460
365175040
120428932
908187460
504108776
The first approach that came to my mind was using HashMap:
static void printMostFrequent100Numbers() throws FileNotFoundException {
// Group unique numbers, key=number, value=frequency
Map<String, Integer> unsorted = new HashMap<>();
try (Scanner scanner = new Scanner(new File("numbers.txt"))) {
while (scanner.hasNextLine()) {
String number = scanner.nextLine();
unsorted.put(number, unsorted.getOrDefault(number, 0) + 1);
}
}
// Sort by frequency in descending order
List<Map.Entry<String, Integer>> sorted = new LinkedList<>(unsorted.entrySet());
sorted.sort((o1, o2) -> o2.getValue().compareTo(o1.getValue()));
// Print first 100 numbers
int count = 0;
for (Map.Entry<String, Integer> entry : sorted) {
System.out.println(entry.getKey());
if (++count == 100) {
return;
}
}
}
But it probably would throw an OutOfMemory exception for the data set of 4,000,000,000 numbers. Moreover, since 4,000,000,000 exceeds the maximum length of a Java array, let's say numbers are in a text file and they are not sorted. I assume multithreading or Map Reduce would be more appropriate for big data set?
How can the top 100 values be calculated when the data does not fit into the available memory?

If the data is sorted, you can collect the top 100 in O(n) where n is the data's size. Because the data is sorted, the distinct values are contiguous. Counting them while traversing the data once gives you the global frequency, which is not available to you when the data is not sorted.
See the sample code below on how this can be done. There is also an implementation (in Kotlin) of the entire approach on GitHub
Note: Sorting is not required. What is required is that distinct values are contiguous and so there is no need for ordering to be defined - we get this from sorting but perhaps there is a way of doing this more efficiently.
You can sort the data file using (external) merge sort in roughly O(n log n) by splitting the input data file into smaller files that fit into your memory, sorting and writing them out into sorted files then merging them.
About this code sample:
Sorted data is represented by a long[]. Because the logic reads values one by one, it's an OK approximation of reading the data from a sorted file.
The OP didn't specify how multiple values with equal frequency should be treated; consequently, the code doesn't do anything beyond ensuring that the result is top N values in no particular order and not implying that there aren't other values with the same frequency.
import java.util.*;
import java.util.Map.Entry;
class TopN {
private final int maxSize;
private Map<Long, Long> countMap;
public TopN(int maxSize) {
this.maxSize = maxSize;
this.countMap = new HashMap(maxSize);
}
private void addOrReplace(long value, long count) {
if (countMap.size() < maxSize) {
countMap.put(value, count);
} else {
Optional<Entry<Long, Long>> opt = countMap.entrySet().stream().min(Entry.comparingByValue());
Entry<Long, Long> minEntry = opt.get();
if (minEntry.getValue() < count) {
countMap.remove(minEntry.getKey());
countMap.put(value, count);
}
}
}
public Set<Long> get() {
return countMap.keySet();
}
public void process(long[] data) {
long value = data[0];
long count = 0;
for (long current : data) {
if (current == value) {
++count;
} else {
addOrReplace(value, count);
value = current;
count = 1;
}
}
addOrReplace(value, count);
}
public static void main(String[] args) {
long[] data = {0, 2, 3, 3, 4, 5, 5, 5, 5, 6, 6, 6, 7};
TopN topMap = new TopN(2);
topMap.process(data);
System.out.println(topMap.get()); // [5, 6]
}
}

Integers are signed 32 bits, so if only positive integers happen, we look at 2^31 max different entries. An array of 2^31 bytes should stay under max array size.
But that can't hold frequencies higher than 255, you would say? Yes, you're right.
So we add an hashmap for all entries that exceed the max value possible in your array (255 - if it's signed just start counting at -128). There are at most 16 million entries in this hash map (4 billion divided by 255), which should be possible.
We have two data structures:
a large array, indexed by the number read (0..2^31) of bytes.
a hashmap of (number read, frequency)
Algorithm:
while reading next number 'x'
{
if (hashmap.contains(x))
{
hashmap[x]++;
}
else
{
bigarray[x]++;
if (bigarray[x] > 250)
{
hashmap[x] = bigarray[x];
}
}
}
// when done:
// Look up top-100 in hashmap
// if not 100 yet, add more from bigarray, skipping those already taken from the hashmap
I'm not fluent in Java, so can't give a better code example.
Note that this algorithm is single-pass, works on unsorted input, and doesn't use external pre-processing steps.
All it does is assuming a maximum to the number read. It should work if the input are non-negative Integers, which have a maximum of 2^31. The sample input satisfies that constraint.
The algorithm above should satisfy most interviewers that ask this question. Whether you can code in Java should be established by a different question. This question is about designing data structures and efficient algorithms.

In pseudocode:
Perform an external sort
Do a pass to collect the top 100 frequencies (not which values have them)
Do another pass to collect the values that have those frequencies
Assumption: There are clear winners - no ties (outside the top 100).
Time complexity: O(n log n) (approx) due to sort.
Space complexity: Available memory, again due to sort.
Steps 2 and 3 are both O(n) time and O(1) space.
If there are no ties (outside the top 100), steps 2 and 3 can be combined into one pass, which wouldn’t improve the time complexity, but would improve the run time slightly.
If there are ties that would make the quantity of winners large, you couldn’t discover that and take special action (e.g., throw error or discard all ties) without two passes. You could however find the smallest 100 values from the ties with one pass.

But it probably would throw an OutOfMemory exception for the data set of 4000000000 numbers. Moreover, since 4000000000 exceeds max length of Java array, let's say numbers are in a text file and they are not sorted.
That depends on the value distribution. If you have 4E9 numbers, but the numbers are integers 1-1000, then you will end up with a map of 1000 entries. If the numbers are doubles or the value space is unrestricted, then you may have an issue.
As in the previous answer - there's a bug
unsorted.put(number, unsorted.getOrDefault(number, 0) + 1);
I personally would use "AtomicLong" for value, it allows to increase the value without updating the HashMap entries.
I assume multithreading or Map Reduce would be more appropriate for big data set?
What would be the most efficient solution for this problem?
This is a typical map-reduce exercise example, so in theory you could use multi-threaded or M-R approach. Maybe it's the goal of your exercise and you suppose to implement the multithreaded map-reduce tasks regardless if it's the most efficient way or not.
In reality you should calculate if it is worth the effort. If you're reading the input serially (as it's in your code using the Scanner), then definitely not. If you can split the input files and read multiple parts in parallel, considering the I/O throughput, it may be the case.
Or maybe if the value space is too large to fit into memory and you will need to downscale the dataset, you may consider different approach.

One option is a type of binary search. Consider a binary tree where each split corresponds to a bit in a 32-bit integer. So conceptually we have a binary tree of depth 32. At each node, we can compute the count of numbers in the set that start with the bit sequence for that node. This count is an O(n) operation, so the total cost of finding our most common sequence is going to be O(n * f(n)) where the function depends on how many nodes we need to enumerate.
Let's start by considering a depth-first search. This provides a reasonable upper bound to the stack size during enumeration. A brute force search of all nodes is obviously terrible (in that case, you can ignore the tree concept entirely and just enumerate over all the integers), but we have two things that can prevent us from needing to search all nodes:
If we ever reach a branch where there are 0 numbers in the set starting with that bit sequence, we can prune that branch and stop enumerating.
Once we hit a terminal node, we know how many occurrences of that specific number there are. We add this to our 'top 100' list, removing the lowest if necessary. Once this list fills up, we can start pruning any branches whose total count is lower than the lowest of the 'top 100' counts.
I'm not sure what the average and worst-case performance for this would be. It would tend to perform better for sets with fewer distinct numbers and probably performs worst for sets that approach uniformly distributed, since that implies more nodes will need to be searched.
A few observations:
There are at most N terminal nodes with non-zero counts, but since N > 2^32 in this specific case, that doesn't matter.
The total number of nodes for M leaf nodes (M = 2^32) is 2M-1. This is still linear in M, so worst case running time is bounded above at O(N*M).
This will perform worse than just searching all integers for some cases, but only by a linear scalar factor. Whether this performs better on average depends on the the expected data. For uniformly random data sets, my intuitive guess is that you'd be able to prune enough branches once the top-100 list fills up that you would tend to require fewer than M counts, but that would need to evaluated empirically or proven.
As a practical matter, the fact that this algorithm just requires read-only access to the data set (it only ever performs a count of numbers starting with a certain bit pattern) means it is amenable to parallelization by storing the data across multiple arrays, counting the subsets in parallel, then adding the counts together. This could be a pretty substantial speedup in a practical implementation that's harder to do with an approach that requires sorting.
A concrete example of how this might execute, for a simpler set of 3-bit numbers and only finding the single most frequent. Let's say the set is '000, 001, 100, 001, 100, 010'.
Count all numbers that start with '0'. This count is 4.
Go deeper, count all numbers that start with '00'. This count is 3.
Count all numbers that are '000'. This count is 1. This is our new most frequent.
Count all numbers that are '001'. This count is 2. This is our new most frequent.
Take next deep branch and count all numbers that start with '01'. This count is 1, which is less than our most frequent, so we can stop enumerating this branch.
Count all numbers that start with '1'. This count is 1, which is less than our most frequent, so we can stop enumerating this branch.
We're out of branches, so we're done and '001' is the most frequent.

Since the data set is presumably too big for memory, I'd do a hexadecimal radix sort. So the data set would get split between 16 files in each pass with as many passes as needed to get to the largest integer.
The second part would be to combine the files into one large data set.
The third part would be to read the file number by number and count the occurrence of each number. Save the number and number of occurrences into a two-dimensional array (the list) which is sorted by size. If the next number from the file has more occurrences than the number in the list with the lowest occurrences then replace that number.

Linux tools
That's simply done in a shell script on Linux/Mac:
sort inputfile | uniq -c | sort -nr | head -n 100
If the data is already sorted, you just use
uniq -c inputfile | sort -nr | head -n 100
File system
Another idea is to use the number as the filename and increase the file size for each hit
while read number;
do
echo -n "." >> number
done <<< inputfile
File system constraints could cause trouble with that many files, so you can create a directory tree with the first digits and store the files there.
When finished, you traverse through the tree and remember the 100 highest seen values for file size.
Database
You can use the same approach with a database, so you don't need to actually store the GB of data there (works too), just the counters (needs less space).
Interview
An interesting question would be how you handle edge cases, so what should happen if the 100th, 101st, ... number have the same frequency. Are the integers only positive?
What kind of output do they need, just the numbers or also the frequencies? Just think it through like a real task at work and ask everything you need to know to solve it. It's more about how you think and analyze a problem.

I have noticed there is a bug in this line.
unsorted.put(number, unsorted.getOrDefault(number, 1) + 1);
You should make the default value as 0 as you are then adding 1 to it. If not when you only have 1 occurrence of a value, it is recorded as the frequency of 2.
unsorted.put(number, unsorted.getOrDefault(number, 0) + 1);
One downside that I see is the unnecessity of keeping all 4 billion frequencies when you are sorting.
You can use a PriorityQueue to hold only 100 values.
Map<String, Integer> unsorted = new HashMap<>();
PriorityQueue<Map.Entry<String, Integer>> highestFrequentValues = new PriorityQueue<>(100,
(o1, o2) -> o2.getValue().compareTo(o1.getValue()));
// O(n)
try (Scanner scanner = new Scanner(new File("numbers.txt"))) {
while (scanner.hasNextLine()) {
String number = scanner.nextLine();
unsorted.put(number, unsorted.getOrDefault(number, 0) + 1);
}
}
// O(n)
for (Map.Entry<String, Integer> stringIntegerEntry : unsorted.entrySet()) {
if (highestFrequentValues.size() < 100) {
highestFrequentValues.add(stringIntegerEntry);
} else {
Map.Entry<String, Integer> minFrequencyWithinHundredEntries = highestFrequentValues.poll();
if (minFrequencyWithinHundredEntries.getValue() < stringIntegerEntry.getValue()) {
highestFrequentValues.add(stringIntegerEntry);
}
}
}
// O(n)
for (Map.Entry<String, Integer> frequentValue : highestFrequentValues) {
System.out.println(frequentValue.getKey());
}

OK, I know that the question is about Java and algorithms and solving this problem otherwise is not the point, but I still think this solution must be posted for completeness.
Solution in sh:
sort FILE | uniq -c | sort -nr | head -n 100
Explanation: sort | uniq -c lists only unique entries and counts the number of their occurrences in the input; sort -nr sorts the output numerically in reverse order (the lines with more occurrences on the top); head -n 100 keeps 100 top lines only. A file with 4,000,000,000 numbers up to 999999999 (as per OP) will take about ~40GB, so fits well on a disk of a single machine, so it is technically possible to use this solution.
Pro: simple, has constant and limited memory usage. Cons: sub-optimal (because of sort), consumes lots of the temporary disk space for the operation, and overall there is no doubt that a solution specifically designed for this problem will have a much better performance. The question remains (in all seriousness): in a general case, will writing (and then debugging and executing) an optimized solution take more or less time than using a sub-optimal one (as above) but available immediately? I ran the solution on a sample file with 400,000,000 lines (10x smaller) and it took about 7 minutes on my computer.
P.S. On a side note, OP mentions that this question was asked during a programming interview. This is interesting because I think this a kind of a solution worth mentioning in this context before starting to code another program from scratch. When people say "experienced engineers are 10x faster...", I personally don't think that this is because experienced engineers code faster or produce optimized algorithms off the top of the head, but because they explore the alternatives that can save time. In the context of an interview it is an important skill to demonstrate among others.

I suppose that 4 trillion was chosen to be sure the problem is too large to fit in memory on current desktop machines. So rent a large VM from Amazon or Microsoft for the purpose? That's an answer most people don't think of yet but is valid for real-world solutions.
The way I'd approach it is start by binning. The range of numbers is presumably all 32-bit unsigned integers (or whatever they said). How large of an array does fit in RAM? divide the range into that many equal bins and pass through the data once. Look over the distribution: Is it fairly uniform, or spikey, or a curve of some kind? If the first/last range of bins are zeros then it gives you the true range of input values, and you can adjust the program to just bin over that range and repeat, to get better accuracy.
Then depending on the distribution, decide how to proceed. In general, only the top 100 bins can possibly contain the top 100 values, so you can reconfigure with those ranges and the largest bins you can handle within that excerpted range. If the distribution is too uniform, you might get many many bins with all the same count, so drop the smaller bins even though you have many more than 100 bins remaining -- you still cut it down some.
Worst case is that all the bins come out the same and you can't cut it down this way! Someone prepared some pathological data assuming this kind of approach. So re-arrange the way you do the binning. Rather than simply chopping into contiguous ranges of equal size, us a 1:1 mapping to shuffle them. However, for large bins, this might preserve the property of being fairly uniform, so you don't want a conventional "good" hashing function.
Another approach
If binning works, and rapidly cuts down the problem, it's easy. But the data could be such that it's actually very difficult. So what's a way that always works, regardless of the data? Well, I can assume that the result exists: some 100 values will have more occurrences.
Instead of bins, pick n specific values (however many you can fit in memory). Either choose random numbers, or use the first N distinct values from your input. Count those, and copy the others to another file. That is, the values you don't have room to count get copied to a (smaller the original) file.
Now you'll at least have a useful pivot value: the exact cardinality of the 100 distinct top values that you did count exactly. Well, the ones you picked might still end up being all the same count! So you only have 1 distinct cardinality worst case. You know that this is not a "top" value since there are far more an 100 of them.
Run again on your new (smaller) file, and discard counts that are smaller than the top 100 you already know. Repeat.
This reminds me of something that I might have read in Knuth's TAOCP, but scaled up for modern machine sizes.

I would just drop all the numbers in a database (SQLite would be my first choice) with a table like
CREATE TABLE tbl (
number INTEGER PRIMARY KEY,
counter INTEGER
)
Then for every number received, just do a
INSERT INTO tbl (number,counter) VALUES (:number,1) ON DUPLICATE KEY UPDATE counter=counter+1;
or with SQLite syntax
INSERT INTO tbl (number,counter) VALUES (:number,1) ON CONFLICT(number) DO UPDATE SET counter=counter+1;
Then when all the numbers are accounted for,
SELECT number, counter FROM tbl ORDER BY counter DESC LIMIT 100
... then I would end up with the 100 most common numbers, and how often they occurred. This scheme will only break when you run out of disk space... (or when you reach ~20000000000000 (20 trillion) unique digits at some ~281 terabytes of disk space... )

Divide your numbers into two buckets
Find top 100 in each bucket
Merge those top 100 lists.
To divide, do median of medians (which can be modified to make medians of the top/bottom as well).
Each bucket has a distinct range of numbers in it. The initial median split makes 2 buckets, each with half (about) as many elements as the entire list in it.
To find the top 100, first know if the bucket is narrow (similar minimum and maximum) O(1) or small (few numbers in it) (O(n) time O(n*bucket count) memory). If either is true, a simple counting pass (possibly doing more than 1 bucket at once) solves it (you will have to do it more than once probably, as you have memory limits).
If neither is true, recurse and divide that bucket into two.
There are going to be fiddly bits with how you recurse without wasting too much time.
But the idea is that each bucket exponentially gets narrower or smaller. Narrow buckets have a minimum and maximum that is close, and small buckets have few elements.
You merge buckets so that you have enough storage to count the elements in the bucket (either width based, or volume based). Then you do a pass that counts that bucket and finds the top 100, and repeat. Each time you merge the top 100 from the scan into the previous top 100.
In-place, no sorting of the entire list needed, and devolves to simpler and more optimal strategies when the initial "bucket" is narrow or small.

I assume that the point of the challenge is to process this large amount of data without consuming too much memory, and avoid parsing the input too many times.
Here's an algorithm that would require two not too large arrays. Don't know about java, but I am confident that this can be made to run very fast in C:
Create a Count array of size 2^n to count the number of input numbers based on their n most significant bits. That will require a first scan over the input data but is really straightforward to do. I would first try with n=20 (about one million buckets).
Obviously, we won't process the data one bucket at a time, as that would require reading the input a million times, instead we choose our optimal batch size B and allocate a Batch array of size B. B could be like 40M, so that we aim at reading the input about 100 times. (It all depends on available memory).
Then we iterate over the count array to group the first range of buckets so that the sum is close to, but doesn't exceed B.
For each such range, we parse the input data, look for numbers in range and copy those numbers to the batch array. Since we already know the size of each bucket, we can immediately copy them grouped per bucket, so that we only have to sort them bucket by bucket (you can repurpose the count array to store the indices for where to write the next entry). Next we count the identical items in the sorted batch array and keep track of the top 100 so far.
Proceed the next range of buckets for which the sum of counts is under size B, etc...
Optimizations:
Once we start having a decent top 100, you can skip entire buckets whose size is below our 100th entry. For this we can use a special value (such as -1) in the count array, to indicate there is not index. Depending on the data, this can drastically reduce the number of passes required.
When counting identical items in the sorted Batch, we can make jumps of the size of your 100th entry (and then take a few steps backwards. I can share pseudo-code if needed)
Potential issues with this approach:
The input numbers could be concentrated in a small range, then you might get one or more single buckets that are larger than B. Possible solutions:
You could try another selection of n bits instead (eg. the n least significant bits). Note that that still won't help if the same numbers appears a billion times.
If the input is 32bit integers, then the range of possible values is limited, and there can only be a few thousand different numbers in each bucket. So if one bucket is really large, then we can process that bucket differently: Just keep a counter for each unique value in that range. We can repurpose the Batch array for that.

Find if a permutation (using + and -) on a string of integers matches a number

Basically what I am doing it taking a string of integers (e.g. "1234"), and I am able to insert a + or - anywhere in this string, as much or little as I want. For example, I can do "1 + 2 + 3 + 4", "12 + 34", "123 - 4", etc. It is required to use all integers of the string, I cannot exclude any.
What I am trying to do is take another array of integers, and find if it was possible to get that number using the permutations mentioned in the first paragraph. I am somewhat lost on where to start looking for this. I could possibly create a recursive loop function to create every possible combination of the string and see if each result matches but this seems like it will be terribly slow. Another thought was to index them into an array - that way I could simply look up the answers after calculating them once.
Anyone have any suggestions?

I could possibly create a recursive loop function to create every possible combination of the string and see if each result matches but this seems like it will be terribly slow.
Doing an exhaustive search is your only option here. Fortunately, the timing isn't going to be too bad even for moderately long strings of up to 7..10 characters, because you do not need to "redo" additions and subtractions of a prior string when you process the "tail".
An outline of a possible implementation could be as follows:
Put all desired results from your array of integers in a hash set
Make a recursive method that takes the result so far, the string, and the position of the next "cut"
When the next "cut" is at the end of the string, check the result so far against the hash set from step 1
Otherwise, try these three possibilities in a loop on k
Use a k-digit number from the "cut" as a positive number, and make a recursive invocation with the "cut" moved by k digits. This is equivalent to inserting a + at the cut
Use a k-digit number from the "cut" as a negative number, and make a recursive invocation with the "cut" moved by k digits. This is equivalent to inserting a - at the cut

I'll give start help, with the approach for such a solution.
formal problem statement;
data model;
algorithm;
heuristics, cleverness.
For N digits there are some 3^N possibilities.
The solution must model the running data as:
the digits, as int[]
the sum
index from which to advance, prior digits were done.
number partalready tried, plus sign. Sign must come separate (as -1, +1) as the coming digit may be 0;
(What I leave out is the collecting of the entire result.)
The brute force solution then could be:
boolean solve(int[] digits, int sum) {
return solve(digits, sum, 1, 0, 0);
}
boolean solve(int[] digits, int sum, int signum, int part, int index) {
if (index >= digits.length) {
return signum * part == sum;
}
// Before the digit at index do either nothing, +, or -
return solve(digits, sum, signum, part * 10 + digits[index], index + 1)
|| solve(digits, sum - signum * part, 1, 0, index + 1)
|| solve(digits, sum - signum * part, -1, 0, index + 1);
}
Mind you could also split the digits in half and try to insert (nothing, +, -) there.
There are pruning opportunities, to diminish the number of tries. First the above can be done in a loop, the alternatives need not all to be tried. The order of evaluation might favor more likely candidates:
if digit 0 ...
if part > sum first - then +
...
Unfortunately +/- make a number theoretical approach AFAIK for me illusory.
#dasblinkenlight mentions even better data models, allowing to not
repeat evaluation in the alternatives. That would be even more
interesting. But might fail miserably due to time constraints. And I
wanted to come with something concrete. Without providing an entirely
ready made solution.

It is reasonable to take a brute force approach if you can rely on the input string not to be too long. If it contains n digits then you can construct 3n-1 formulae from it (between each pair of digits you can insert '+', '-', or nothing, for n-1 internal positions). For a 12-digit input string that's roughly 270000 formulae, which should be computable quite quickly. Of course, you would build and compute each one once, and compare the result to all the alternatives. Don't redo the computation for each array element.
It may be that there's a dynamic programming approach to this, but I'm not immediately seeing it, at least not one that would be substantially better than brute force.

Arrangements of sets of k positions in a n-competitors race

this is a copy of my post on mathexchange.com.
Let E(n) be the set of all possible ending arrangements of a race of n competitors.
Obviously, because it's a race, each one of the n competitors wants to win.
Hence, the order of the arrangements does matter.
Let us also say that if two competitors end with the same result of time, they win the same spot.
For example, E(3) contains the following sets of arrangements:
{(1,1,1), (1,1,2), (1,2,1), (1,2,2), (1,2,3), (1,3,2), (2,1,1), (2,1,2),(2,1,3), (2,2,1), (2,3,1), (3,1,2), (3,2,1)}.
Needless to say, for example, that the arrangement (1,3,3) is invalid, because the two competitors that supposedly ended in the third place, actually ended in the second place. So the above arrangement "transfers" to (1,2,2).
Define k to be the number of distinct positions of the competitors in a subset of E(n).
We have for example:
(1,1,1) -------> k = 1
(1,2,1) -------> k = 2
(1,2,3,2) -------> k = 3
(1,2,1,5,4,4,3) -------> k = 5
Finally, let M(n,k) be the number of subsets of E(n) in which the competitors ended in exactly k distinct positions.
We get, for example,M(3,3) = M(3,2) = 6 and M(3,1) = 1.
-------------------------------------------------------------------------------------------
Thus far is the question
It's a problem I came up with solely by myself. After some time of thought I came up with the following recursive formula for |E(n)|:
(Don't continue reading if you want to derive a formula yourself!)
|E(n)| = sum from l=1 to n of C(n,l)*|E(n-l)| where |E(0)| = 1
And the code in Java for this function, using the BigInteger class:
public static BigInteger E (int n)
{
if (!Ens[n].equals(BigInteger.ZERO))
return Ens[n];
else
{
BigInteger ends=BigInteger.ZERO;
for (int l=1;l<=n;l++)
ends=ends.add(factorials[n].divide(factorials[l].multiply(factorials[n-l])).multiply(E(n-l)));
Ens[n]=ends;
return ends;
}
}
The factorials array is an array of precalculated factorials for faster binomial coefficients calculations.
The Ens array is an array of the memoized/cached E(n) values which really quickens the calculating, due to the need of repeatedly calculating certain E(n) values.
The logic behind this recurrence relation is that l symbolizes how many "first" spots we have. For each l, the binomial coefficient C(n,l) symbolizes in how many ways we can pick l first-placers out of the n competitors. Once we have chosen them, we to need to figure out in how many ways we can arrange the n-l competitors we have left, which is just |E(n-l)|.
I get the following:
|E(3)| = 13
|E(5)| = 541
|E(10)| = 102247563
|E(100)| mod 1 000 000 007 = 619182829 -------> 20 ms.
And |E(1000)| mod 1 000 000 007 = 581423957 -------> 39 sec.
I figured out that |E(n)| can also be visualized as the number of sets to which the following applies:
For every i = 1, 2, 3 ... n, every i-tuple subset of the original set has GCD (greatest common divisor) of all of its elements equal to 1.
But I'm not 100% sure about this because I was not able to compute this approach for large n.
However, even with precalculating factorials and memoizing the E(n)'s, the calculating times for higher n's grow very fast.
Is anyone capable of verifying the above formula and values?
Can anyone derive a better, faster formula? Perhaps with generating functions?
As for M(n,k).. I'm totally clueless. I absolutely have no idea how to calculate it, and therefore I couldn't post any meaningful data points.
Perhaps it's P(n,k) = n!/(n-k)!.
Can anyone figure out a formula for M(n,k)?
I have no idea which function is harder to compute, either E(n) or M(n,k), but helping me with either of them will be very much appreciable.
I want the solutions to be generic as well as work efficiently even for large n's. Exhaustive search is not what I'm looking for, unfortunately.
What I am looking for is solutions based purely on combinatorial approach and efficient formulas.
I hope I was clear enough with the wording and what I ask for throughout my post. By the way, I can program using Java. I also know Mathematica pretty decently :) .
Thanks a lot in advance,
Matan.

E(n) are the Fubini numbers. M(n, k) = S(n, k) * k!, where S(n, k) is a Stirling number of the second kind, because S(n, k) is the number of different placing partitions, and k! is the number of ways to rank them.

Generating a partially ordered random list of numbers

I want to generate a list of random numbers of size 500, where the list is exactly 30% sorted (I know how to generate a list of at least 30% sorted), but that's not what i want, how do i generate a file that is "exactly" 30%? I'm stuck, How can this be done?
Here is the exact wording
"For the sorts, you should construct three different files of each size: ordered, keys in reverse order, and finally one in which 30% of the keys are ordered. The latter file should not consist of files in which your sort is 30% complete, but rather in files in which 30% of the keys are correctly placed with respect to one another but are not necessarily contiguous.

There are 2 main ideas I can see for percentage sorted:
Simply the number of elements out of place.
Once should be able to get an estimated % sorted by sorting it, then iterating through it, and, keeping each element the same with the desired percentage as probability, otherwise swapping it with a random remaining element (so, if we want 30% sorted, we'll keep an element the same with 30% probability, and swap it with 70%).
If an exact number is needed, one could use the above result and (intelligently) swap random elements until the desired percentage is obtained.
The number of inversions.
An inversion is a pair of places of a sequence where the elements on these places are out of their natural order.
One idea is to first sort it, then to swap random elements that get us closer to the desired percentage sorted, until we get there.
Only swapping elements that get us closer to the desired result is difficult (at least doing so efficiently).
A very brute force approach would be to count the change in the number of inversions that each pair of swaps would cause, and then pick a random one that gets us closer to our target.
Another idea is to just generate random pairs and count the number of inversions until we find one that gets us closer.
A third option is to pick a random element. If it's larger than half the elements, try to move it left (ideally increasing the number of inversions). If it's smaller, try to move it right. In trying to move it left/right, we can look for a smaller / larger element (respectively) to swap it with and count the change in inversions (we only need to consider the elements between the swapped elements when counting the change in inversions).
At first we could probably just randomly swap elements as we're likely to tend to more inversions.
If the percentage is above 50%, we could also start with a reversed array, i.e. 100% unsorted.

There's a one-to-one correspondence that maps permutations to {0} x {0, 1} x {0, 1, 2} x ... x {0, 1, ... n - 1}, where the jth element of the tuple in the codomain is the number of inversions involving elements at positions j and i < j. In this light, the problem is sampling a random element of the codomain that sums to the desired number of inversions.
Here's an instance of Gibbs sampling for this problem. Initialize a tuple summing to the desired number of permutations. Repeatedly select two distinct indices and randomize uniformly among all possibilities with the same sum. Stop when you're tired of waiting (the distribution converges on uniform but never gets there; maybe tomorrow I will figure out a Propp--Wilson style technique for exact samples).
In Python (untested):
import random
def gibbs(n, target):
perm = [0] * n
for i in range(n):
perm[i] = min(target, i)
target -= i
assert target == 0
while ???:
i = random.randrange(n)
j = random.randrange(n)
if i == j: continue
total = perm[i] + perm[j]
perm[i] = random.randrange(max(total - j, 0), i + 1)
perm[j] = total - perm[i]
for j in range(n):
perm[j] = j - perm[j]
for i in range(j):
if perm[i] >= perm[j]: perm[i] += 1
return perm
One could also get exact samples by dynamic programming and conditional probability, but the running time for 500 looks slightly prohibitive from here.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.