Comparing numbers or using prime numbers? - java

I'm writing a program that generates bingo card numbers. The bingo card is composed of 5 columns, 4 numbers for each column. For the first column can only contain numbers 1-8, second 9-16, and so on (upto 40).
So in the database, what I did is I have two tables for this. The first table is for the column numbers. Each column contains a unique set of numbers (70 sets for each column, which is a combination of 8 taken 4). For 5 columns, I will have 350 sets. The second table is the card numbers. This is composed of 5 columns, each corresponding to the row for B, I, N, G, O. All in all, there are 1,680,700,000. possible combinations for this table. I did this way because each cards are duplicated for each game, only control numbers for cards are unique.
I want to track the winning card for every drawn number. I need the tracking as fast as I could, cause where talking about millions of cards here. I thought of 2 options on doing this:
First, checking each drawn number if it exists on the cards, minimizing the card pool for each draw.
Second, associating a unique prime number for each number(1-40), multiplying them and associate the product to the column (which I call the prime index). The 5 prime indexes, for each column, is multiplied and the product is associated to each card/combination (which I call the card index). When a number is drawn, the associated prime number is divided from the card index, checking if the drawn number is a factor of the card index. Each consecutive draw reduce the card index (for each card in the pool), and thus reduced to 1 if a winning card exists. I will be using MySQL and Java. Which of these 2 techniques is the faster approach? I also do consider the memory space, load, etc., but it is more important to me the speed of the tracking. Thanks a lot!
P.S. Sorry for the long explanation. I just want to clarify things. :D

If you want to be really fast just keep your 24 million cards in memory while they are needed and just do a simple comparation. Using the database for this is overkill and just makes everything more difficult. RAM is not expensive anymore.

There are exactly 70^5=1,680,700,000 possible cards. There is no need to store the cards itself. You can calculate the numbers on the card directly, with only the index. The other way around, given numbers find indices of cards, is just a little bit harder.
For example, card #1421934546. Putting this in base 70 gives: 59 15 40 50 46. (I mean 46 + 70*50 + 70^2*40 + 70^3*15 + 70^4*59 = 1421934546). So, the first column is the 46th(actually 47th because off-by-one) take of the possible 70 takes.
Given the drawn numbers, you can quickly find the columns that match. For example, with numbers 1,2 and 3. There are 5 sets in the first column that match, 1234, 1235, 1236, 1237 and 1238. So, all matching cards have % 70 one of those 5 ids. If you find all the possible sets for each columns, the Cartesian product will give all matching cards.

You don't have to arrange the data in memory the same as you have on the card. e.g. if you have N squares which can be either selected or not selected, a BitSet may be a good choice. This uses 1 bit per square (with some overhead).
Say you have up to 64 squares, this is one long value (64-bits). If you have 1 million cards. This will take up 8 MB of memory. Once you determine which card(s) are winners, you can determine who the owners are. (This could be stored in a database)
Say you sell a card to every adult in the US. (AFAIK, no lottery has ever been this popular) At say one dollar each, you would be bringing 200 million dollars. You would need 1.6 GB of memory which would fit into a 4 GB server costing about $500 easily. You could buy a 16 GB server for about $1000 just to be sure. ;)

Related

What are some ways to store and recover numbers in this situation?

So i'm going to be running a simulator that plays craps.
My assignment requires me to run the sim 10,000,000 times.
None of that is an issue; I have the sim made and I know how to run in and I know how to create the required variables.
What I'm unsure of, is how I should go about storing the results of each game?
What I need to find in the end is:
Average # of Rolls Per Game
Max # of Rolls in a game
number of games that needed more than 30 rolls
number of wins
number of losses
probability of a win
longest sequence of wins and longest sequence of losses
All easy enough, I'm just not sure how to store 10,000,000 numbers and then access them easily.
For example the first:
Average number of rolls
should I create an arraylist that has 10,000,000 items in it? add one item at the end of each game and then add them all up and divide by 10,000,000?
I realize this should work, I'm just wondering if there is another way, or perhaps a better (more efficient) way.
New part to this question:
Can I return more than one value from a method? Currently the simulation runs 10,000,000 times and returns a win or loss from each time. But I also need it to return the number of rolls from each game... Otherwise I can't figure out the values for avg rolls and highest number of rolls and number of games over 30 rolls.
Any ideas here?
You don't need to maintain array for any of the statistics you want.
For average number of rolls per game, just keep a variable, say cumulativeNumberOfRolls; after every game, just output the number of rolls in that game and add it to this variable. When all simulations are done, just divide this value by total number of simulations (10,000,000).
For max. number of rolls, again keep a single variable, say maxRolls; after every game, output the number of rolls in that game and compare that with this variable. If the number of rolls in this game is greater, then just update maxRolls with the new value. Try the same approach - of having a single variable and updating it after every game - to get the value for games that required more than 30 rolls, number of wins and number of losses. If you face problems, we can discuss them in comments.
For longest sequence of wins and losses, you would need to maintain a bunch of variables:
longest win sequence overall
longest loss sequence overall
current sequence count
current sequence type (indicates if current sequence is a win sequence or loss sequence)
Here's the overview of the approach.
After every game, compare the result of the game with the current sequence type. If they are same, for instance result of current game is win and the current sequence type is also a win, then just update the current sequence count and move on to the next game. If they are different, you need to consider two scenarios and do slightly different things for them. I'll explain for one - the result of current game is loss and the current sequence type is win. In this scenario, compare current sequence count with longest win sequence overall and if it (current sequence count) is greater then just update the longest win sequence overall. After this, change the current sequence type to loss and set the current sequence count to 1.
Extend the above approach for the second scenario - the result of the current game is win and the current sequence type is loss. If you have clarifications, feel free to post back in comments.
You could just calculate the statistics as you go without storing them. For instance, if you hava an "average" field in your class, then after each simulation average = ((number of rolls this game) + (total rolls so far)) / (number of games so far). The same could be done for the other statistics.
Well, you've got a fixed number of runs, so you might as well use an array rather than an arraylist (faster). It seems to me that you actually only need two total arrays: one listing the outcome of each game (maybe true/false for win/lose), and one with the number of rolls in that game. You fill these up as you run the simulations; then you get to do a bunch of simple math involving one array or the other to get your stats. That seems like the best way to go about it to me; I don't think you're going to get much more efficient without a lot of undue effort.

8 puzzle: Solvability and shortest solution

I have built a 8 puzzle solver using Breadth First Search. I would now want to modify the code to use heuristics. I would be grateful if someone could answer the following two questions:
Solvability
How do we decide whether an 8 puzzle is solvable ? (given a starting state and a goal state )
This is what Wikipedia says:
The invariant is the parity of the permutation of all 16 squares plus
the parity of the taxicab distance (number of rows plus number of
columns) of the empty square from the lower right corner.
Unfortunately, I couldn't understand what that meant. It was a bit complicated to understand. Can someone explain it in a simpler language?
Shortest Solution
Given a heuristic, is it guaranteed to give the shortest solution using the A* algorithm? To be more specific, will the first node in the open list always have a depth ( or the number of movements made so fat ) which is the minimum of the depths of all the nodes present in the open list?
Should the heuristic satisfy some condition for the above statement to be true?
Edit : How is it that an admissible heuristic will always provide the optimal solution? And how do we test whether a heuristic is admissible?
I would be using the heuristics listed here
Manhattan Distance
Linear Conflict
Pattern Database
Misplaced Tiles
Nilsson's Sequence Score
N-MaxSwap X-Y
Tiles out of row and column
For clarification from Eyal Schneider :
I'll refer only to the solvability issue. Some background in permutations is needed.
A permutation is a reordering of an ordered set. For example, 2134 is a reordering of the list 1234, where 1 and 2 swap places. A permutation has a parity property; it refers to the parity of the number of inversions. For example, in the following permutation you can see that exactly 3 inversions exist (23,24,34):
1234
1432
That means that the permutation has an odd parity. The following permutation has an even parity (12, 34):
1234
2143
Naturally, the identity permutation (which keeps the items order) has an even parity.
Any state in the 15 puzzle (or 8 puzzle) can be regarded as a permutation of the final state, if we look at it as a concatenation of the rows, starting from the first row. Note that every legal move changes the parity of the permutation (because we swap two elements, and the number of inversions involving items in between them must be even). Therefore, if you know that the empty square has to travel an even number of steps to reach its final state, then the permutation must also be even. Otherwise, you'll end with an odd permutation of the final state, which is necessarily different from it. Same with odd number of steps for the empty square.
According to the Wikipedia link you provided, the criteria above is sufficient and necessary for a given puzzle to be solvable.
The A* algorithm is guaranteed to find the (one if there are more than one equal short ones) shortest solution, if your heuristic always underestimates the real costs (In your case the real number of needed moves to the solution).
But on the fly I cannot come up with a good heuristic for your problem. That needs some thinking to find such a heuristic.
The real art using A* is to find a heuristic that always underestimates the real costs but as little as possible to speed up the search.
First ideas for such a heuristic:
A quite pad but valid heuristic that popped up in my mind is the manhatten distance of the empty filed to its final destination.
The sum of the manhatten distance of each field to its final destination divided by the maximal number of fields that can change position within one move. (I think this is quite a good heuristic)
For anyone coming along, I will attempt to explain how the OP got the value pairs as well as how he determines the highlighted ones i.e. inversions as it took me several hours to figure it out. First the pairs.
First take the goal state and imagine it as a 1D array(A for example)
[1,2,3,8,0,4,7,5]. Each value in that array has it's own column in the table(going all the way down, which is the first value of the pair.)
Then move over 1 value to the right in the array(i + 1) and go all the way down again, second pair value. for example(State A): the first column, second value will start [2,3,8,0,4,7,5] going down. the second column, will start [3,8,0,4,7,5] etc..
okay now for the inversions. for each of the 2 pair values, find their INDEX location in the start state. if the left INDEX > right INDEX then it's an inversion(highlighted). first four pairs of state A are: (1,2),(1,3),(1,8),(1,0)
1 is at Index 3
2 is at Index 0
3 > 0 so inversion.
1 is 3
3 is 2
3 > 2 so inversion
1 is 3
8 is 1
3 > 2 so inversion
1 is 3
0 is 7
3 < 7 so No inversion
Do this for each pairs and tally up the total inversions.
If both even or both odd (Manhattan distance of blank spot And total inversions)
then it's solvable. Hope this helps!

External shuffle: shuffling large amount of data out of memory

I am looking for a way to shuffle a large amount of data which does not fit into memory (approx. 40GB).
I have around 30 millions entries, of variable length, stored in one large file. I know the starting and ending positions of each entry in that file. I need to shuffle this data which does not fit in the RAM.
The only solution I thought of is to shuffle an array containing the numbers from 1 to N, where N is the number of entries, with the Fisher-Yates algorithm and then copy the entries in a new file, according to this order. Unfortunately, this solution involves a lot of seek operations, and thus, would be very slow.
Is there a better solution to shuffle large amount of data with uniform distribution?
First get the shuffle issue out of your face. Do this by inventing a hash algorithm for your entries that produces random-like results, then do a normal external sort on the hash.
Now you have transformed your shuffle into a sort your problems turn into finding an efficient external sort algorithm that fits your pocket and memory limits. That should now be as easy as google.
A simple approach is to pick a K such that 1/K of the data fits comfortably in memory. Perhaps K=4 for your data, assuming you've got 16GB RAM. I'll assume your random number function has the form rnd(n) which generates a uniform random number from 0 to n-1.
Then:
for i = 0 .. K-1
Initialize your random number generator to a known state.
Read through the input data, generating a random number rnd(K) for each item as you go.
Retain items in memory whenever rnd(K) == i.
After you've read the input file, shuffle the retained data in memory.
Write the shuffled retained items to the output file.
This is very easy to implement, will avoid a lot of seeking, and is clearly correct.
An alternative is to partition the input data into K files based on the random numbers, and then go through each, shuffling in memory and writing to disk. This reduces disk IO (each item is read twice and written twice, compared to the first approach where each item is read K times and written once), but you need to be careful to buffer the IO to avoid a lot of seeking, it uses more intermediate disk, and is somewhat more difficult to implement. If you've got only 40GB of data (so K is small), then the simple approach of multiple iterations through the input data is probably best.
If you use 20ms as the time for reading or writing 1MB of data (and assuming the in-memory shuffling cost is insignificant), the simple approach will take 40*1024*(K+1)*20ms, which is 1 minute 8 seconds (assuming K=4). The intermediate-file approach will take 40*1024*4*20ms, which is around 55 seconds, assuming you can minimize seeking. Note that SSD is approximately 20 times faster for reads and writes (even ignoring seeking), so you should expect to perform this task in well under 10s using an SSD. Numbers from Latency Numbers every Programmer should know
I suggest keeping your general approach, but inverting the map before doing the actual copy. That way, you read sequentially and do scattered writes rather than the other way round.
A read has to be done when requested before the program can continue. A write can be left in a buffer, increasing the probability of accumulating more than one write to the same disk block before actually doing the write.
Premise
From what I understand, using the Fisher-Yates algorithm and the data you have about the positions of the entries, you should be able to obtain (and compute) a list of:
struct Entry {
long long sourceStartIndex;
long long sourceEndIndex;
long long destinationStartIndex;
long long destinationEndIndex;
}
Problem
From this point onward, the naive solution is to seek each entry in the source file, read it, then seek to the new position of the entry in the destination file and write it.
The problem with this approach is that it uses way too many seeks.
Solution
A better way to do it, is to reduce the number of seeks, using two huge buffers, for each of the files.
I recommend a small buffer for the source file (say 64MB) and a big one for the destination file (as big as the user can afford - say 2GB).
Initially, the destination buffer will be mapped to the first 2GB of the destination file. At this point, read the whole source file, in chunks of 64MB, in the source buffer. As you read it, copy the proper entries to the destination buffer. When you reach the end of the file, the output buffer should contain all the proper data. Write it to the destination file.
Next, map the output buffer to the next 2GB of the destination file and repeat the procedure. Continue until you have wrote the whole output file.
Caution
Since the entries have arbitrary sizes, it's very likely that at the beginning and ending of the buffers you will have suffixes and prefixes of entries, so you need to make sure you copy the data properly!
Estimated time costs
The execution time depends, essentially, on the size of the source file, the available RAM for the application and the reading speed of the HDD. Assuming a 40GB file, a 2GB RAM and a 200MB/s HDD read speed, the program will need to read 800GB of data (40GB * (40GB / 2GB)). Assuming the HDD is not highly fragmented, the time spent on seeks will be negligible. This means the reads will take up one hour! But if, luckily, the user has 8GB of RAM available for your application, the time may decrease to only 15 to 20 minutes.
I hope this will be enough for you, as I don't see any other faster way.
Although you can use external sort on a random key, as proposed by OldCurmudgeon, the random key is not necessary. You can shuffle blocks of data in memory, and then join them with a "random merge," as suggested by aldel.
It's worth specifying what "random merge" means more clearly. Given two shuffled sequences of equal size, a random merge behaves exactly as in merge sort, with the exception that the next item to be added to the merged list is chosen using a boolean value from a shuffled sequence of zeros and ones, with exactly as many zeros as ones. (In merge sort, the choice would be made using a comparison.)
Proving it
My assertion that this works isn't enough. How do we know this process gives a shuffled sequence, such that every ordering is equally possible? It's possible to give a proof sketch with a diagram and a few calculations.
First, definitions. Suppose we have N unique items, where N is an even number, and M = N / 2. The N items are given to us in two M-item sequences labeled 0 and 1 that are guaranteed to be in a random order. The process of merging them produces a sequence of N items, such that each item comes from sequence 0 or sequence 1, and the same number of items come from each sequence. It will look something like this:
0: a b c d
1: w x y z
N: a w x b y c d z
Note that although the items in 0 and 1 appear to be in order, they are just labels here, and the order doesn't mean anything. It just serves to connect the order of 0 and 1 to the order of N.
Since we can tell from the labels which sequence each item came from, we can create a "source" sequence of zeros and ones. Call that c.
c: 0 1 1 0 1 0 0 1
By the definitions above, there will always be exactly as many zeros as ones in c.
Now observe that for any given ordering of labels in N, we can reproduce a c sequence directly, because the labels preserve information about the sequence they came from. And given N and c, we can reproduce the 0 and 1 sequences. So we know there's always one path back from a sequence N to one triple (0, 1, c). In other words, we have a reverse function r defined from the set of all orderings of N labels to triples (0, 1, c) -- r(N) = (0, 1, c).
We also have a forward function f from any triple r(n) that simply re-merges 0 and 1 according to the value of c. Together, these two functions show that there is a one-to-one correspondence between outputs of r(N) and orderings of N.
But what we really want to prove is that this one-to-one correspondence is exhaustive -- that is, we want to prove that there aren't extra orderings of N that don't correspond to any triple, and that there aren't extra triples that don't correspond to any ordering of N. If we can prove that, then we can choose orderings of N in a uniformly random way by choosing triples (0, 1, c) in a uniformly random way.
We can complete this last part of the proof by counting bins. Suppose every possible triple gets a bin. Then we drop every ordering of N in the bin for the triple that r(N) gives us. If there are exactly as many bins as orderings, then we have an exhaustive one-to-one correspondence.
From combinatorics, we know that number of orderings of N unique labels is N!. We also know that the number of orderings of 0 and 1 are both M!. And we know that the number of possible sequences c is N choose M, which is the same as N! / (M! * (N - M)!).
This means there are a total of
M! * M! * N! / (M! * (N - M)!)
triples. But N = 2 * M, so N - M = M, and the above reduces to
M! * M! * N! / (M! * M!)
That's just N!. QED.
Implementation
To pick triples in a uniformly random way, we must pick each element of the triple in a uniformly random way. For 0 and 1, we accomplish that using a straightforward Fisher-Yates shuffle in memory. The only remaining obstacle is generating a proper sequence of zeros and ones.
It's important -- important! -- to generate only sequences with equal numbers of zeros and ones. Otherwise, you haven't chosen from among Choose(N, M) sequences with uniform probability, and your shuffle may be biased. The really obvious way to do this is to shuffle a sequence containing an equal number of zeros and ones... but the whole premise of the question is that we can't fit that many zeros and ones in memory! So we need a way to generate random sequences of zeros and ones that are constrained such that there are exactly as many zeros as ones.
To do this in a way that is probabilistically coherent, we can simulate drawing balls labeled zero or one from an urn, without replacement. Suppose we start with fifty 0 balls and fifty 1 balls. If we keep count of the number of each kind of ball in the urn, we can maintain a running probability of choosing one or the other, so that the final result isn't biased. The (suspiciously Python-like) pseudocode would be something like this:
def generate_choices(N, M):
n0 = M
n1 = N - M
while n0 + n1 > 0:
if randrange(0, n0 + n1) < n0:
yield 0
n0 -= 1
else:
yield 1
n1 -= 1
This might not be perfect because of floating point errors, but it will be pretty close to perfect.
This last part of the algorithm is crucial. Going through the above proof exhaustively makes it clear that other ways of generating ones and zeros won't give us a proper shuffle.
Performing multiple merges in real data
There remain a few practical issues. The above argument assumes a perfectly balanced merge, and it also assumes you have only twice as much data as you have memory. Neither assumption is likely to hold.
The fist turns out not to be a big problem because the above argument doesn't actually require equally sized lists. It's just that if the list sizes are different, the calculations are a little more complex. If you go through the above replacing the M for list 1 with N - M throughout, the details all line up the same way. (The pseudocode is also written in a way that works for any M greater than zero and less than N. There will then be exactly M zeros and M - N ones.)
The second means that in practice, there might be many, many chunks to merge this way. The process inherits several properties of merge sort — in particular, it requires that for K chunks, you'll have to perform roughly K / 2 merges, and then K / 4 merges, and so on, until all the data has been merged. Each batch of merges will loop over the entire dataset, and there will be roughly log2(K) batches, for a run time of O(N * log(K)). An ordinary Fisher-Yates shuffle would be strictly linear in N, and so in theory would be faster for very large K. But until K gets very, very large, the penalty may be much smaller than the disk seeking penalties.
The benefit of this approach, then, comes from smart IO management. And with SSDs it might not even be worth it — the seek penalties might not be large enough to justify the overhead of multiple merges. Paul Hankin's answer has some practical tips for thinking through the practical issues raised.
Merging all data at once
An alternative to doing multiple binary merges would be to merge all the chunks at once -- which is theoretically possible, and might lead to an O(N) algorithm. The random number generation algorithm for values in c would need to generate labels from 0 to K - 1, such that the final outputs have exactly the right number of labels for each category. (In other words, if you're merging three chunks with 10, 12, and 13 items, then the final value of c would need to have 0 ten times, 1 twelve times, and 2 thirteen times.)
I think there is probably an O(N) time, O(1) space algorithm that will do that, and if I can find one or work one out, I'll post it here. The result would be a truly O(N) shuffle, much like the one Paul Hankin describes towards the end of his answer.
Logically partition your database entries (for e.g Alphabetically)
Create indexes based on your created partitions
build DAO to sensitize based on index

Ascending Integers in TXT to Array

my problem is to get huge Text Files (UTF-8 -1byte (ANSI)) containing unsigned Integers without duplicates in Ascending Order into an Array. FAST!
So I was going for something like:
while(scan.hasNextInt()) x.add(scan.nextInt());
But whether i go with an ArrayList, Vectors or a plain Array with Files containing millions of Integers it would be wise to determine the maximum Capacity needed to avoid increasing the array size later.
With File.length() i will get the amount of digits + Line Feeds in the File.
In the worst Case it would start at 0 and in each line only increment by 1.
I think somehow the max. capacity is calculable using combinatorics, but I am at a dead end. The fact that smaller Numbers don't get filled with Zeros (002) somehow throws me off.
Taking the size of the first Int into consideration i think one might also be able to approximate a little further to the real amount.
So my most important question is to calculate an approximated [in O(1)]maximum Capacity needed.
In addition I am asking my self if scan.hasNextInt() and scan.nextInt() are the fastest considering this rather unique problem and if parallelization via Threads could speed up the process even more (considering the features of reading from a Hard Drive probably not).
regards
Halo
Assuming there is only one byte used to separate two numbers (eg. a '\n') we have
10 numbers with 1 digit -> 20 bytes
90 numbers with 2 digits -> 270 bytes
900 numbers with 3 digits -> 3600 bytes
... you get the pattern
If your file size is now 1000 bytes, the max you can have is the 10 1 digits, the 90 two digits, with 710 bytes left for 3 digit numbers. 710/4 = 177.5, which makes at most 10+90+177 = 277 numbers.

Efficiency between lists and methods

I was thinking of making a Sudoku solver, I have 2 questions:
1) What would be faster?
A) Go through all the empty spots, have a list of numbers (1-9) remove them if it is in same line, or same category, then if it is length 1, add the only one remaining. Repeat this while needed.
B) Go through all the numbers, then check all the spots to see if they can have that number. Repeat this while needed.
2) What is the most efficient List for housing a list under 9 in length?
Thanks,
Legend
Answer 2) Not a list but a set would make sense. In this case BitSet.
Case 1) There are 27 rules in a 9x9 sudoku.
Case 1A) Every spot participates in 3 rules.
Case 1B) Every number is 9 times repeated; appears in 3 rules.
Answer 1) 1A and 1B should theoretical not be different, but 1A seems to make an algorithm & data structure easier.
I think B works! You can use a backtracking algorithm to check the empty spot with any of the 1-9 numbers(in order). Fill the spot with first available choice(1-9) and move ahead. If at any point you are unable to insert a number into a slot then backtrack to the previous slot and try a different number.
This might be helpful :
http://edwinchan.wordpress.com/2006/01/08/sudoku-solver-in-c-using-backtracking/

Categories

Resources