Given a string S, I want to find out whether there are non-overlapping substrings A, B and C in S, so that the equation A + B = C holds when the substrings are interpreted as decimal numbers.
Example: For S = 17512, the answer is yes, because 12 + 5 = 17 holds.
This is not a homework question, I have tried approaching this problem building a suffix array
17512
7512
512
12
2
but then I realize that given 132, 1 + 2 = 3
Would require other forms of permutations in selection?
How do you solve this in an efficient way?
Let S be the decimal representation of the number. If n = |S| is small enough (<500 or so), you can use the following algorithm:
Let us enumerate A and C from the equation A + B = C (where we assume w.l.o.g. A > B). We know that they need to be of around the same size (plus/minus one digit), so enumerating the possibilities is a cubic operation (there are O(n3) candidates).
For every candidate pair (A, C), we need to check whether B = C - A is in the string and not overlapping with any of the A or C substrings. We can compute the difference in linear time using arithmetics in base 10.
The tricky part is to check whether B is a substring not overlapping A or C. A and C split the string into 3 parts:
S = xAyCz
If we enumerate them in a clever way, with fixed start positions and decreasing size, we can maintain suffix automata of part x and the reverses of parts y and z.
Now we can check in linear time whether B = C - A (or its reverse) exists in one of the three parts.
Time complexity of this approach: Θ(n4).
There is a variation of this which is slightly more complicated, but faster (thanks to Evgeny for pointing it out):
Create a suffix tree of the input string. Every node represents a substring. Store in every node a balanced binary search tree of the positions where the substring occurs in the string. You might need persistent trees here to save time and space.
Enumerate A and C, but this time starting from the least-significant digit (the rightmost end).
While growing A and C from right to left, keep track of the result of B = C - A. It will also grow from least-significant to most-siginificant digit. Do a search for B in the suffix tree. You can do this one digit at a time, so you can make grow A and C by 1 digit, update B and locate it in the suffix tree in O(1).
If B is positive, do three range queries in the BBST of positions to check whether B occurs in the string and does not overlap A or C
Runtime: O(n3 log n).
UPDATE: regarding the simplified version where all characters need to be used:
We first realize that we can do arithmetics on substrings of our string in linear time, if we work in base 10.
Now we want to find the splitting points a < b, so that your three substrings are A = s1...sa, B = sa+1...sb and C = sb+1...sn.
We can prove that there is only a constant number of candidates for a and b, because all three parts must have approximately the same size for the equation to hold.
Using arbitrary precision arithmetics, we can easily try out all candidate pairs (a,b) and for each of those, find M = max(A,B,C). Then just check whether M is the sum of the other two numbers.
Total time complexity: Θ(n).
If you are allowed to form substrings from arbtirary subsets of digits in their original given order as long as your digits don't overlap in the 2 summands and the sum, then I believe your problem is NP-complete. I think this is even true if the target sum is given and all you have to do is find two non-overlapping substrings of digits that add up to the target sum. However I don't have a proof of NP-completeness yet.
If your substrings of digits have to be consecutive then the situation is much better. You can search over all combinations of 2 summands and 1 sum for the starting and ending points of the numbers in O(n^6) time, and certainly there are improvements that can be made because e.g. for a given target sum, you only need to search over pairs of substrings whose max length adds up to the length of your target sum either exactly or minus 1.
UPDATE: If you need to find 3 non-overlapping contiguous substrings that give you the summation formula, then you can hash all O(n^2) substring values and then hash the sum of all pairs of summands to see if the target sum is in your hash table. If so, then you only need to check if the summand beginning and ending indices do not overlap the summand indices. Worst-case time is O(n^6), expected running time is O(n^5) for random inputs.
Assuming (As in both your examples) that your 3 substrings are contiguous, non-overlapping, non-negative, and between them cover the whole input, then there is a quadratic time solution.
First (temporarily) assume the order is aaabbbccc where aaa+bbb=ccc, and aaa>bbb.
The length of ccc must either be the same as aaa or at most one larger.
So the length of aaa (len_a) must be between n/3 and n/2.
Given the len_a, there are two choices for len_c --- len_a or len_a+1.
Given these, there is only one possible length of bbb. len_b = n-len_a = len_c
Test these 2(n/2 - n/3) = n/3 cases.
Each test is O(n) cost due to string to int conversion.
Repeat the above analysis for two permutations (aaa>bbb v bbb>=aaa), times three permutations (aaa+bbb=ccc v aaa+ccc=bbb v bbb+ccc=aaa)
You could improve the test to check only the most (or least) significant i digits of the three numbers, returning early if the sum was not possible. Assuming randomly distributed digits, you might be able to show that the expected run time of such a test was constant.
This would turn the whole algorithm into an O(n) runtime.
Related
I am trying to think how to solve the Subset sum problem with an extra constraint: The subset of the array needs to be continuous (the indexes needs to be). I am trying to solve it using recursion in Java.
I know the solution for the non-constrained problem: Each element can be in the subset (and thus I perform a recursive call with sum = sum - arr[index]) or not be in it (and thus I perform a recursive call with sum = sum).
I am thinking about maybe adding another parameter for knowing weather or not the previous index is part of the subset, but I don't know what to do next.
You are on the right track.
Think of it this way:
for every entry you have to decide: do you want to start a new sum at this point or skip it and reconsider the next entry.
a + b + c + d contains the sum of b + c + d. Do you want to recompute the sums?
Maybe a bottom-up approach would be better
The O(n) solution that you asked for:
This solution requires three fixed point numbers: The start and end indices, and the total sum of the span
Starting from element 0 (or from the end of the list if you want) increase the end index until the total sum is greater than or equal to the desired value. If it is equal, you've found a subset sum. If it is greater, move the start index up one and subtract the value of the previous start index. Finally, if the resulting total is greater than the desired value, move the end index back until the sum is less than the desired value. In the other case (where the sum is less) move the end index forward until the sum is greater than the desired value. If no match is found, repeat
So, caveats:
Is this "fairly obvious"? Maybe, maybe not. I was making assumptions about order of magnitude similarity when I said both "fairly obvious" and o(n) in my comments
Is this actually o(n)? It depends a lot on how similar (in terms of order of magnitude (digits in the number)) the numbers in the list are. The closer all the numbers are to each other, the fewer steps you'll need to make on the end index to test if a subset exists. On the other hand, if you have a couple of very big numbers (like in the thousands) surrounded by hundreds of pretty small numbers (1's and 2's and 3's) the solution I've presented will get closers to O(n^2)
This solution only works based on your restriction that the subset values are continuous
this is a copy of my post on mathexchange.com.
Let E(n) be the set of all possible ending arrangements of a race of n competitors.
Obviously, because it's a race, each one of the n competitors wants to win.
Hence, the order of the arrangements does matter.
Let us also say that if two competitors end with the same result of time, they win the same spot.
For example, E(3) contains the following sets of arrangements:
{(1,1,1), (1,1,2), (1,2,1), (1,2,2), (1,2,3), (1,3,2), (2,1,1), (2,1,2),(2,1,3), (2,2,1), (2,3,1), (3,1,2), (3,2,1)}.
Needless to say, for example, that the arrangement (1,3,3) is invalid, because the two competitors that supposedly ended in the third place, actually ended in the second place. So the above arrangement "transfers" to (1,2,2).
Define k to be the number of distinct positions of the competitors in a subset of E(n).
We have for example:
(1,1,1) -------> k = 1
(1,2,1) -------> k = 2
(1,2,3,2) -------> k = 3
(1,2,1,5,4,4,3) -------> k = 5
Finally, let M(n,k) be the number of subsets of E(n) in which the competitors ended in exactly k distinct positions.
We get, for example,M(3,3) = M(3,2) = 6 and M(3,1) = 1.
-------------------------------------------------------------------------------------------
Thus far is the question
It's a problem I came up with solely by myself. After some time of thought I came up with the following recursive formula for |E(n)|:
(Don't continue reading if you want to derive a formula yourself!)
|E(n)| = sum from l=1 to n of C(n,l)*|E(n-l)| where |E(0)| = 1
And the code in Java for this function, using the BigInteger class:
public static BigInteger E (int n)
{
if (!Ens[n].equals(BigInteger.ZERO))
return Ens[n];
else
{
BigInteger ends=BigInteger.ZERO;
for (int l=1;l<=n;l++)
ends=ends.add(factorials[n].divide(factorials[l].multiply(factorials[n-l])).multiply(E(n-l)));
Ens[n]=ends;
return ends;
}
}
The factorials array is an array of precalculated factorials for faster binomial coefficients calculations.
The Ens array is an array of the memoized/cached E(n) values which really quickens the calculating, due to the need of repeatedly calculating certain E(n) values.
The logic behind this recurrence relation is that l symbolizes how many "first" spots we have. For each l, the binomial coefficient C(n,l) symbolizes in how many ways we can pick l first-placers out of the n competitors. Once we have chosen them, we to need to figure out in how many ways we can arrange the n-l competitors we have left, which is just |E(n-l)|.
I get the following:
|E(3)| = 13
|E(5)| = 541
|E(10)| = 102247563
|E(100)| mod 1 000 000 007 = 619182829 -------> 20 ms.
And |E(1000)| mod 1 000 000 007 = 581423957 -------> 39 sec.
I figured out that |E(n)| can also be visualized as the number of sets to which the following applies:
For every i = 1, 2, 3 ... n, every i-tuple subset of the original set has GCD (greatest common divisor) of all of its elements equal to 1.
But I'm not 100% sure about this because I was not able to compute this approach for large n.
However, even with precalculating factorials and memoizing the E(n)'s, the calculating times for higher n's grow very fast.
Is anyone capable of verifying the above formula and values?
Can anyone derive a better, faster formula? Perhaps with generating functions?
As for M(n,k).. I'm totally clueless. I absolutely have no idea how to calculate it, and therefore I couldn't post any meaningful data points.
Perhaps it's P(n,k) = n!/(n-k)!.
Can anyone figure out a formula for M(n,k)?
I have no idea which function is harder to compute, either E(n) or M(n,k), but helping me with either of them will be very much appreciable.
I want the solutions to be generic as well as work efficiently even for large n's. Exhaustive search is not what I'm looking for, unfortunately.
What I am looking for is solutions based purely on combinatorial approach and efficient formulas.
I hope I was clear enough with the wording and what I ask for throughout my post. By the way, I can program using Java. I also know Mathematica pretty decently :) .
Thanks a lot in advance,
Matan.
E(n) are the Fubini numbers. M(n, k) = S(n, k) * k!, where S(n, k) is a Stirling number of the second kind, because S(n, k) is the number of different placing partitions, and k! is the number of ways to rank them.
I've been struggling with a Subset algorithm question recently.
How to get all subsets from a char string?
Condition: each subset cannot cover all the distinct letters of the original char string.
For instance, abbc, [a,b,c] -> output-> a, b, c, ab, abb, bbc, bb, bc
Subset: {abc}, and {abbc} should be removed!
My initial thought is to preprocess the original string into a1b2c1, then go recursively, each recursive layer process one distinct letter. In the last layer, like here, we need process c, whether we should put c in the subset depends on the information passed down by previous layers.
I am not sure my idea is good, does anyone has ideas about this question?
If you need to cover only letters (i.e. the number of distinct objects is under 26, inclusive), then you can make a bit set that represents the "universe". This bit set would have 1 in a position of a letter that is in your alphabet, and zero for all other positions.
You can go recursively the way that you described, passing down the universe bit set, along with the soFar bit set, which represents the letters that have been added so far. When you reach an invocation where soFar is equal to the universe, you know that your bit set would have all available letters, and not add it to the list of results.
In the worst case scenario, you will have to process all 2^n subsets, therefore I don't think that you can do better time wise.
Your idea for an answer is pretty good that way.
The only piece of information that needs to be passed to the next recursive step is whether you've chosen all distinct characters until now or not. Which can be done with only one bit. Here's an example:
Denote value and count pairs as v[i] and c[i]. Let n denote the number of such pairs.
The state of the recursive function can be defined as
F(i, b) = Return all subsets including values > v[i]. b = 1, if these subsets should not contain all distinct members.
F(n, b) is a trivial solution to the above recursion.
You seek F(0, 1)
Naturally you'll have to iterate from j = 0 to c[i] to consider all possible cases, ie, how many v[i] the subset shall include.
For j = 0 or b = 0, next recursive set will have b=0
Otherwise it should be 1.
This should generate all subsets that you seek.
Could some one guide me on how to solve this problem.
We are given a set S with k number of elements in it.
Now we have to divide the set S into x subsets such that the difference in number of elements in each subset is not more than 1 and the sum of each subset should be as close to each other as possible.
Example 1:
{10, 20, 90, 200, 100} has to be divided into 2 subsets
Solution:{10,200}{20,90,100}
sum is 210 and 210
Example 2:
{1, 1, 2, 1, 1, 1, 1, 1, 1, 6}
Solution:{1,1,1,1,6}{1,2,1,1,1}
Sum is 10 and 6.
I see a possible solution in two stages.
Stage One
Start by selecting the number of subsets, N.
Sort the original set, S, if possible.
Distribute the largest N numbers from S into subsets 1 to N in order.
Distribute the next N largest numbers from S the subsets in reverse order, N to 1.
Repeat until all numbers are distributed.
If you can't sort S, then distribute each number from S into the subset (or one of the subsets) with the least entries and the smallest total.
You should now have N subsets all sized within 1 of each other and with very roughly similar totals.
Stage Two
Now try to refine the approximate solution you have.
Pick the largest total subset, L, and the smallest total subset, M. Pick a number in L that is smaller than a number in M but not by so much as to increase the absolute difference between the two subsets. Swap the two numbers. Repeat. Not all pairs of subsets will have swappable numbers. Each swap keeps the subsets the same size.
If you have a lot of time you can do a thorough search; if not then just try to pick off a few obvious cases. I would say don't swap numbers if there is no decrease in difference; otherwise you might get an infinite loop.
You could interleave the stages once there are at least two members in some subsets.
There is no easy algorithm for this problem.
Check out the partition problem also known as the easiest hard problem , that solve this for 2 sets. This problem is NP-Complete, and you should be able to find all the algorithms to solve it on the web
I know your problem is a bit different since you can chose the number of sets, but you can inspire yourself from solutions to the previous one.
For example :
You can transform this into a serie of linear programs, let k be the number of element in your set.
{a1 ... ak} is your set
For i = 2 to k:
try to solve the following program:
xjl = 1 if element j of set is in set number l (l <= i) and 0 otherwise
minimise max(Abs(sum(apxpn) -sum(apxpm)) for all m,n) // you minimise the max of the difference between 2 sets.
s.t
sum(xpn) on n = 1
(sum(xkn) on k)-(sum(xkm) on k) <= 1 for all m n // the number of element in 2 list are different at most of one element.
xpn in {0,1}
if you find a min less than one then stop
otherwise continue
end for
Hope my notations are clear.
The complexity of this program is exponential, and if you find a polynomial way to solve this you would probe P=NP so I don't think you can do better.
EDIT
I saw you comment ,I missunderstood the constraint on the size of the subsets (I thought it was the difference between 2 sets)
I don't I have time to update it I will do it when I have time.
sryy
EDIT 2
I edited the linear program and it should do what it's asked to do. I just added a constraint.
Hope this time the problem is fully understood, even though this solution might not be optimal
I'm no scientist, so I'd try two approaches:
After sorting items, then going from both "ends" and moving first and last to the actual set,then shift to next set, loop;
Then:
Checking the differences of sums of the sets, and shuffling items if it would help.
Coding the resulting sets appropriately and trying genetic algorithms.
You are given a list of n numbers L=<a_1, a_2,...a_n>. Each of them is
either 0 or of the form +/- 2k, 0 <= k <= 30. Describe and implement an
algorithm that returns the largest product of a CONTINUOUS SUBLIST
p=a_i*a_i+1*...*a_j, 1 <= i <= j <= n.
For example, for the input <8 0 -4 -2 0 1> it should return 8 (either 8
or (-4)*(-2)).
You can use any standard programming language and can assume that
the list is given in any standard data structure, e.g. int[],
vector<int>, List<Integer>, etc.
What is the computational complexity of your algorithm?
In my first answer I addressed the OP's problem in "multiplying two big big numbers". As it turns out, this wish is only a small part of a much bigger problem which I'm going to address now:
"I still haven't arrived at the final skeleton of my algorithm I wonder if you could help me with this."
(See the question for the problem description)
All I'm going to do is explain the approach Amnon proposed in little more detail, so all the credit should go to him.
You have to find the largest product of a continuous sublist from a list of integers which are powers of 2. The idea is to:
Compute the product of every continuous sublist.
Return the biggest of all these products.
You can represent a sublist by its start and end index. For start=0 there are n-1 possible values for end, namely 0..n-1. This generates all sublists that start at index 0. In the next iteration, You increment start by 1 and repeat the process (this time, there are n-2 possible values for end). This way You generate all possible sublists.
Now, for each of these sublists, You have to compute the product of its elements - that is come up with a method computeProduct(List wholeList, int startIndex, int endIndex). You can either use the built in BigInteger class (which should be able to handle the input provided by Your assignment) to save You from further trouble or try to implement a more efficient way of multiplication as described by others. (I would start with the simpler approach since it's easier to see if Your algorithm works correctly and first then try to optimize it.)
Now that You're able to iterate over all sublists and compute the product of their elements, determining the sublist with the maximum product should be the easiest part.
If it's still to hard for You to make the connections between two steps, let us know - but please also provide us with a draft of Your code as You work on the problem so that we don't end up incrementally constructing the solution and You copy&pasting it.
edit: Algorithm skeleton
public BigInteger listingSublist(BigInteger[] biArray)
{
int start = 0;
int end = biArray.length-1;
BigInteger maximum;
for (int i = start; i <= end; i++)
{
for (int j = i; j <= end; j++)
{
//insert logic to determine the maximum product.
computeProduct(biArray, i, j);
}
}
return maximum;
}
public BigInteger computeProduct(BigInteger[] wholeList, int startIndex,
int endIndex)
{
//insert logic here to return
//wholeList[startIndex].multiply(wholeList[startIndex+1]).mul...(
// wholeList[endIndex]);
}
Since k <= 30, any integer i = 2k will fit into a Java int. However the product of such two integers might not necessarily fit into a Java int since 2k * 2k = 22*k <= 260 which fill into a Java long. This should answer Your question regarding the "(multiplication of) two numbers...".
In case that You might want to multiply more than two numbers, which is implied by Your assignment saying "...largest product of a CONTINUOUS SUBLIST..." (a sublist's length could be > 2), have a look at Java's BigInteger class.
Actually, the most efficient way of multiplication is doing addition instead. In this special case all you have is numbers that are powers of two, and you can get the product of a sublist by simply adding the expontents together (and counting the negative numbers in your product, and making it a negative number in case of odd negatives).
Of course, to store the result you may need the BigInteger, if you run out of bits. Or depending on how the output should look like, just say (+/-)2^N, where N is the sum of the exponents.
Parsing the input could be a matter of switch-case, since you only have 30 numbers to take care of. Plus the negatives.
That's the boring part. The interesting part is how you get the sublist that produces the largest number. You can take the dumb approach, by checking every single variation, but that would be an O(N^2) algorithm in the worst case (IIRC). Which is really not very good for longer inputs.
What can you do? I'd probably start from the largest non-negative number in the list as a sublist, and grow the sublist to get as many non-negative numbers in each direction as I can. Then, having all the positives in reach, proceed with pairs of negatives on both sides, eg. only grow if you can grow on both sides of the list. If you cannot grow in both directions, try one direction with two (four, six, etc. so even) consecutive negative numbers. If you cannot grow even in this way, stop.
Well, I don't know if this alogrithm even works, but if it (or something similar) does, its an O(N) algorithm, which means great performance. Lets try it out! :-)
Hmmm.. since they're all powers of 2, you can just add the exponent instead of multiplying the numbers (equivalent to taking the logarithm of the product). For example, 2^3 * 2^7 is 2^(7+3)=2^10.
I'll leave handling the sign as an exercise to the reader.
Regarding the sublist problem, there are less than n^2 pairs of (begin,end) indices. You can check them all, or try a dynamic programming solution.
EDIT: I adjusted the algorithm outline to match the actual pseudo code and put the complexity analysis directly into the answer:
Outline of algorithm
Go seqentially over the sequence and store value and first/last index of the product (positive) since the last 0. Do the same for another product (negative) which only consists of the numbers since the first sign change of the sequence. If you hit a negative sequence element swap the two products (positive and negative) along with the associagted starting indices. Whenever the positive product hits a new maximum store it and the associated start and end indices. After going over the whole sequence the result is stored in the maximum variables.
To avoid overflow calculate in binary logarithms and an additional sign.
Pseudo code
maxProduct = 0
maxProductStartIndex = -1
maxProductEndIndex = -1
sequence.push_front( 0 ) // reuses variable intitialization of the case n == 0
for every index of sequence
n = sequence[index]
if n == 0
posProduct = 0
negProduct = 0
posProductStartIndex = index+1
negProductStartIndex = -1
else
if n < 0
swap( posProduct, negProduct )
swap( posProductStartIndex, negProductStartIndex )
if -1 == posProductStartIndex // start second sequence on sign change
posProductStartIndex = index
end if
n = -n;
end if
logN = log2(n) // as indicated all arithmetic is done on the logarithms
posProduct += logN
if -1 < negProductStartIndex // start the second product as soon as the sign changes first
negProduct += logN
end if
if maxProduct < posProduct // update current best solution
maxProduct = posProduct
maxProductStartIndex = posProductStartIndex
maxProductEndIndex = index
end if
end if
end for
// output solution
print "The maximum product is " 2^maxProduct "."
print "It is reached by multiplying the numbers from sequence index "
print maxProductStartIndex " to sequence index " maxProductEndIndex
Complexity
The algorithm uses a single loop over the sequence so its O(n) times the complexity of the loop body. The most complicated operation of the body is log2. Ergo its O(n) times the complexity of log2. The log2 of a number of bounded size is O(1) so the resulting complexity is O(n) aka linear.
I'd like to combine Amnon's observation about multiplying powers of 2 with one of mine concerning sublists.
Lists are terminated hard by 0's. We can break the problem down into finding the biggest product in each sub-list, and then the maximum of that. (Others have mentioned this).
This is my 3rd revision of this writeup. But 3's the charm...
Approach
Given a list of non-0 numbers, (this is what took a lot of thinking) there are 3 sub-cases:
The list contains an even number of negative numbers (possibly 0). This is the trivial case, the optimum result is the product of all numbers, guaranteed to be positive.
The list contains an odd number of negative numbers, so the product of all numbers would be negative. To change the sign, it becomes necessary to sacrifice a subsequence containing a negative number. Two sub-cases:
a. sacrifice numbers from the left up to and including the leftmost negative; or
b. sacrifice numbers from the right up to and including the rightmost negative.
In either case, return the product of the remaining numbers. Having sacrificed exactly one negative number, the result is certain to be positive. Pick the winner of (a) and (b).
Implementation
The input needs to be split into subsequences delimited by 0. The list can be processed in place if a driver method is built to loop through it and pick out the beginnings and ends of non-0 sequences.
Doing the math in longs would only double the possible range. Converting to log2 makes arithmetic with large products easier. It prevents program failure on large sequences of large numbers. It would alternatively be possible to do all math in Bignums, but that would probably perform poorly.
Finally, the end result, still a log2 number, needs to be converted into printable form. Bignum comes in handy there. There's new BigInteger("2").pow(log); which will raise 2 to the power of log.
Complexity
This algorithm works sequentially through the sub-lists, only processing each one once. Within each sub-list, there's the annoying work of converting the input to log2 and the result back, but the effort is linear in the size of the list. In the worst case, the sum of much of the list is computed twice, but that's also linear complexity.
See this code. Here I implement exact factorial of a huge large number. I am just using integer array to make big numbers. Download the code from Planet Source Code.