frequency analysis algorithm - java

I want to write a java program that searches through a cipher text and returns a frequency count of the characters in the cipher, for example the cipher:
"jshddllpkeldldwgbdpked" will have a result like this:
2 letter occurrences:
pk = 2, ke = 2, ld = 2
3 letter occurrences:
pke = 2.
Any algorithm that allows me to do this as efficiently as possible?

The map strategy is a good one, but I'd go for HashMap<String, Integer>, since it's tuples of characters being counted.
Iterating over the characters in the ciphertext, you can save the last X characters and that will give you a map over all occurrences of substrings of length X+1.

The usual approach would be to use some kind of map to map your characters to their counts. You can use a HashMap<Character, Integer> for example. You can then iterate through your ciphertext, character-wise and either put the character into the map with a count of 1 (if it doesn't yet exist) or increment its count.

You could store the n-grams in a trie, reversing the normal order so the last character in the n-gram is at the top of the trie. Each node in the trie stores a character count. Loop over the string, keeping track of the last N characters (as Buhb suggests). Each time through the outer loop, you traverse the trie, using the last N characters to pick the path, starting with the last character and ending with the Nth to last. For each node you visit, incrementing its counter.
To print the n-gram frequencies, perform a breadth-first traversal of the trie.
Overall performance left as an exercise.

Either have an array with a cell for each possible value (easy if the cipher text is all lower case characters - 26 - harder if not) or go for a Map where you pass in the character and increment the value in either case. The array is quicker but less flexible.

If the set of lengths of sequences you need is fixed, the obvious algorithm takes a linear number of counting operations (say, looking up a counter in a hashtable and incrementing it).
When you say "as efficiently as possible", do you propose to spend a lot of effort for a meagre constant-factor improvement, to search hopelessly for a sublinear algorithm, or do you not understand algorithm complexity classes at all?

You can use hash or graph (Thanks to outis, I know it's special name now, such kind of graphs is called "trie"). Hash will be slower, graph will be faster. Hash will get less memory, graph will take more in bad implementation.
You cannot get it done using array since it will get HUGE amount of memory if your maximum char sequence length is equal to your text length, and text is long enough. If you will limit it it will get smth like ([number of letters]^[max sequence length])*4 bytes which will be (52^4)*4 ~= 24Mb of memory for 4 lower/upper letter sequence. If limited sequence length is ok for you and this memory amount is normal than algorithm will be pretty easy for <=4 letters in sequence.

You could start by looking for the largest possible repeatable sequence first then work your way down from there. For example if the string is 10 characters the largest repeatable sequence that could occur would be 5 letters long so first look for 5 letter sequences then 4 letters and so on till you reach 2. This should reduce the number of iterations in your program.

I dont have an answer in mind for this,
But I feel, this algorithm is the exact same as the algorithm used by compression algorithms to create compressed files with the dictionary approach.
If I am not wrong, in this approach, a dictionary is used in the following manner:
data:
abccccabaccabcaaaaabcaaabbbbbccccaaabcbbbbabbabab
parse 1 :
key: *
value: abc
new data:
*cccabacc*aaaa*aaabbbbbccccaa*bbbbabbabab
Just an educated guess, I think (not sure here) the standard "zip" file uses this approach,
so I suggest you look at these algorithms

Related

How can I get the most frequent 100 numbers out of 4,000,000,000 numbers?

Yesterday in a coding interview I was asked how to get the most frequent 100 numbers out of 4,000,000,000 integers (may contain duplicates), for example:
813972066
908187460
365175040
120428932
908187460
504108776
The first approach that came to my mind was using HashMap:
static void printMostFrequent100Numbers() throws FileNotFoundException {
// Group unique numbers, key=number, value=frequency
Map<String, Integer> unsorted = new HashMap<>();
try (Scanner scanner = new Scanner(new File("numbers.txt"))) {
while (scanner.hasNextLine()) {
String number = scanner.nextLine();
unsorted.put(number, unsorted.getOrDefault(number, 0) + 1);
}
}
// Sort by frequency in descending order
List<Map.Entry<String, Integer>> sorted = new LinkedList<>(unsorted.entrySet());
sorted.sort((o1, o2) -> o2.getValue().compareTo(o1.getValue()));
// Print first 100 numbers
int count = 0;
for (Map.Entry<String, Integer> entry : sorted) {
System.out.println(entry.getKey());
if (++count == 100) {
return;
}
}
}
But it probably would throw an OutOfMemory exception for the data set of 4,000,000,000 numbers. Moreover, since 4,000,000,000 exceeds the maximum length of a Java array, let's say numbers are in a text file and they are not sorted. I assume multithreading or Map Reduce would be more appropriate for big data set?
How can the top 100 values be calculated when the data does not fit into the available memory?
If the data is sorted, you can collect the top 100 in O(n) where n is the data's size. Because the data is sorted, the distinct values are contiguous. Counting them while traversing the data once gives you the global frequency, which is not available to you when the data is not sorted.
See the sample code below on how this can be done. There is also an implementation (in Kotlin) of the entire approach on GitHub
Note: Sorting is not required. What is required is that distinct values are contiguous and so there is no need for ordering to be defined - we get this from sorting but perhaps there is a way of doing this more efficiently.
You can sort the data file using (external) merge sort in roughly O(n log n) by splitting the input data file into smaller files that fit into your memory, sorting and writing them out into sorted files then merging them.
About this code sample:
Sorted data is represented by a long[]. Because the logic reads values one by one, it's an OK approximation of reading the data from a sorted file.
The OP didn't specify how multiple values with equal frequency should be treated; consequently, the code doesn't do anything beyond ensuring that the result is top N values in no particular order and not implying that there aren't other values with the same frequency.
import java.util.*;
import java.util.Map.Entry;
class TopN {
private final int maxSize;
private Map<Long, Long> countMap;
public TopN(int maxSize) {
this.maxSize = maxSize;
this.countMap = new HashMap(maxSize);
}
private void addOrReplace(long value, long count) {
if (countMap.size() < maxSize) {
countMap.put(value, count);
} else {
Optional<Entry<Long, Long>> opt = countMap.entrySet().stream().min(Entry.comparingByValue());
Entry<Long, Long> minEntry = opt.get();
if (minEntry.getValue() < count) {
countMap.remove(minEntry.getKey());
countMap.put(value, count);
}
}
}
public Set<Long> get() {
return countMap.keySet();
}
public void process(long[] data) {
long value = data[0];
long count = 0;
for (long current : data) {
if (current == value) {
++count;
} else {
addOrReplace(value, count);
value = current;
count = 1;
}
}
addOrReplace(value, count);
}
public static void main(String[] args) {
long[] data = {0, 2, 3, 3, 4, 5, 5, 5, 5, 6, 6, 6, 7};
TopN topMap = new TopN(2);
topMap.process(data);
System.out.println(topMap.get()); // [5, 6]
}
}
Integers are signed 32 bits, so if only positive integers happen, we look at 2^31 max different entries. An array of 2^31 bytes should stay under max array size.
But that can't hold frequencies higher than 255, you would say? Yes, you're right.
So we add an hashmap for all entries that exceed the max value possible in your array (255 - if it's signed just start counting at -128). There are at most 16 million entries in this hash map (4 billion divided by 255), which should be possible.
We have two data structures:
a large array, indexed by the number read (0..2^31) of bytes.
a hashmap of (number read, frequency)
Algorithm:
while reading next number 'x'
{
if (hashmap.contains(x))
{
hashmap[x]++;
}
else
{
bigarray[x]++;
if (bigarray[x] > 250)
{
hashmap[x] = bigarray[x];
}
}
}
// when done:
// Look up top-100 in hashmap
// if not 100 yet, add more from bigarray, skipping those already taken from the hashmap
I'm not fluent in Java, so can't give a better code example.
Note that this algorithm is single-pass, works on unsorted input, and doesn't use external pre-processing steps.
All it does is assuming a maximum to the number read. It should work if the input are non-negative Integers, which have a maximum of 2^31. The sample input satisfies that constraint.
The algorithm above should satisfy most interviewers that ask this question. Whether you can code in Java should be established by a different question. This question is about designing data structures and efficient algorithms.
In pseudocode:
Perform an external sort
Do a pass to collect the top 100 frequencies (not which values have them)
Do another pass to collect the values that have those frequencies
Assumption: There are clear winners - no ties (outside the top 100).
Time complexity: O(n log n) (approx) due to sort.
Space complexity: Available memory, again due to sort.
Steps 2 and 3 are both O(n) time and O(1) space.
If there are no ties (outside the top 100), steps 2 and 3 can be combined into one pass, which wouldn’t improve the time complexity, but would improve the run time slightly.
If there are ties that would make the quantity of winners large, you couldn’t discover that and take special action (e.g., throw error or discard all ties) without two passes. You could however find the smallest 100 values from the ties with one pass.
But it probably would throw an OutOfMemory exception for the data set of 4000000000 numbers. Moreover, since 4000000000 exceeds max length of Java array, let's say numbers are in a text file and they are not sorted.
That depends on the value distribution. If you have 4E9 numbers, but the numbers are integers 1-1000, then you will end up with a map of 1000 entries. If the numbers are doubles or the value space is unrestricted, then you may have an issue.
As in the previous answer - there's a bug
unsorted.put(number, unsorted.getOrDefault(number, 0) + 1);
I personally would use "AtomicLong" for value, it allows to increase the value without updating the HashMap entries.
I assume multithreading or Map Reduce would be more appropriate for big data set?
What would be the most efficient solution for this problem?
This is a typical map-reduce exercise example, so in theory you could use multi-threaded or M-R approach. Maybe it's the goal of your exercise and you suppose to implement the multithreaded map-reduce tasks regardless if it's the most efficient way or not.
In reality you should calculate if it is worth the effort. If you're reading the input serially (as it's in your code using the Scanner), then definitely not. If you can split the input files and read multiple parts in parallel, considering the I/O throughput, it may be the case.
Or maybe if the value space is too large to fit into memory and you will need to downscale the dataset, you may consider different approach.
One option is a type of binary search. Consider a binary tree where each split corresponds to a bit in a 32-bit integer. So conceptually we have a binary tree of depth 32. At each node, we can compute the count of numbers in the set that start with the bit sequence for that node. This count is an O(n) operation, so the total cost of finding our most common sequence is going to be O(n * f(n)) where the function depends on how many nodes we need to enumerate.
Let's start by considering a depth-first search. This provides a reasonable upper bound to the stack size during enumeration. A brute force search of all nodes is obviously terrible (in that case, you can ignore the tree concept entirely and just enumerate over all the integers), but we have two things that can prevent us from needing to search all nodes:
If we ever reach a branch where there are 0 numbers in the set starting with that bit sequence, we can prune that branch and stop enumerating.
Once we hit a terminal node, we know how many occurrences of that specific number there are. We add this to our 'top 100' list, removing the lowest if necessary. Once this list fills up, we can start pruning any branches whose total count is lower than the lowest of the 'top 100' counts.
I'm not sure what the average and worst-case performance for this would be. It would tend to perform better for sets with fewer distinct numbers and probably performs worst for sets that approach uniformly distributed, since that implies more nodes will need to be searched.
A few observations:
There are at most N terminal nodes with non-zero counts, but since N > 2^32 in this specific case, that doesn't matter.
The total number of nodes for M leaf nodes (M = 2^32) is 2M-1. This is still linear in M, so worst case running time is bounded above at O(N*M).
This will perform worse than just searching all integers for some cases, but only by a linear scalar factor. Whether this performs better on average depends on the the expected data. For uniformly random data sets, my intuitive guess is that you'd be able to prune enough branches once the top-100 list fills up that you would tend to require fewer than M counts, but that would need to evaluated empirically or proven.
As a practical matter, the fact that this algorithm just requires read-only access to the data set (it only ever performs a count of numbers starting with a certain bit pattern) means it is amenable to parallelization by storing the data across multiple arrays, counting the subsets in parallel, then adding the counts together. This could be a pretty substantial speedup in a practical implementation that's harder to do with an approach that requires sorting.
A concrete example of how this might execute, for a simpler set of 3-bit numbers and only finding the single most frequent. Let's say the set is '000, 001, 100, 001, 100, 010'.
Count all numbers that start with '0'. This count is 4.
Go deeper, count all numbers that start with '00'. This count is 3.
Count all numbers that are '000'. This count is 1. This is our new most frequent.
Count all numbers that are '001'. This count is 2. This is our new most frequent.
Take next deep branch and count all numbers that start with '01'. This count is 1, which is less than our most frequent, so we can stop enumerating this branch.
Count all numbers that start with '1'. This count is 1, which is less than our most frequent, so we can stop enumerating this branch.
We're out of branches, so we're done and '001' is the most frequent.
Since the data set is presumably too big for memory, I'd do a hexadecimal radix sort. So the data set would get split between 16 files in each pass with as many passes as needed to get to the largest integer.
The second part would be to combine the files into one large data set.
The third part would be to read the file number by number and count the occurrence of each number. Save the number and number of occurrences into a two-dimensional array (the list) which is sorted by size. If the next number from the file has more occurrences than the number in the list with the lowest occurrences then replace that number.
Linux tools
That's simply done in a shell script on Linux/Mac:
sort inputfile | uniq -c | sort -nr | head -n 100
If the data is already sorted, you just use
uniq -c inputfile | sort -nr | head -n 100
File system
Another idea is to use the number as the filename and increase the file size for each hit
while read number;
do
echo -n "." >> number
done <<< inputfile
File system constraints could cause trouble with that many files, so you can create a directory tree with the first digits and store the files there.
When finished, you traverse through the tree and remember the 100 highest seen values for file size.
Database
You can use the same approach with a database, so you don't need to actually store the GB of data there (works too), just the counters (needs less space).
Interview
An interesting question would be how you handle edge cases, so what should happen if the 100th, 101st, ... number have the same frequency. Are the integers only positive?
What kind of output do they need, just the numbers or also the frequencies? Just think it through like a real task at work and ask everything you need to know to solve it. It's more about how you think and analyze a problem.
I have noticed there is a bug in this line.
unsorted.put(number, unsorted.getOrDefault(number, 1) + 1);
You should make the default value as 0 as you are then adding 1 to it. If not when you only have 1 occurrence of a value, it is recorded as the frequency of 2.
unsorted.put(number, unsorted.getOrDefault(number, 0) + 1);
One downside that I see is the unnecessity of keeping all 4 billion frequencies when you are sorting.
You can use a PriorityQueue to hold only 100 values.
Map<String, Integer> unsorted = new HashMap<>();
PriorityQueue<Map.Entry<String, Integer>> highestFrequentValues = new PriorityQueue<>(100,
(o1, o2) -> o2.getValue().compareTo(o1.getValue()));
// O(n)
try (Scanner scanner = new Scanner(new File("numbers.txt"))) {
while (scanner.hasNextLine()) {
String number = scanner.nextLine();
unsorted.put(number, unsorted.getOrDefault(number, 0) + 1);
}
}
// O(n)
for (Map.Entry<String, Integer> stringIntegerEntry : unsorted.entrySet()) {
if (highestFrequentValues.size() < 100) {
highestFrequentValues.add(stringIntegerEntry);
} else {
Map.Entry<String, Integer> minFrequencyWithinHundredEntries = highestFrequentValues.poll();
if (minFrequencyWithinHundredEntries.getValue() < stringIntegerEntry.getValue()) {
highestFrequentValues.add(stringIntegerEntry);
}
}
}
// O(n)
for (Map.Entry<String, Integer> frequentValue : highestFrequentValues) {
System.out.println(frequentValue.getKey());
}
OK, I know that the question is about Java and algorithms and solving this problem otherwise is not the point, but I still think this solution must be posted for completeness.
Solution in sh:
sort FILE | uniq -c | sort -nr | head -n 100
Explanation: sort | uniq -c lists only unique entries and counts the number of their occurrences in the input; sort -nr sorts the output numerically in reverse order (the lines with more occurrences on the top); head -n 100 keeps 100 top lines only. A file with 4,000,000,000 numbers up to 999999999 (as per OP) will take about ~40GB, so fits well on a disk of a single machine, so it is technically possible to use this solution.
Pro: simple, has constant and limited memory usage. Cons: sub-optimal (because of sort), consumes lots of the temporary disk space for the operation, and overall there is no doubt that a solution specifically designed for this problem will have a much better performance. The question remains (in all seriousness): in a general case, will writing (and then debugging and executing) an optimized solution take more or less time than using a sub-optimal one (as above) but available immediately? I ran the solution on a sample file with 400,000,000 lines (10x smaller) and it took about 7 minutes on my computer.
P.S. On a side note, OP mentions that this question was asked during a programming interview. This is interesting because I think this a kind of a solution worth mentioning in this context before starting to code another program from scratch. When people say "experienced engineers are 10x faster...", I personally don't think that this is because experienced engineers code faster or produce optimized algorithms off the top of the head, but because they explore the alternatives that can save time. In the context of an interview it is an important skill to demonstrate among others.
I suppose that 4 trillion was chosen to be sure the problem is too large to fit in memory on current desktop machines. So rent a large VM from Amazon or Microsoft for the purpose? That's an answer most people don't think of yet but is valid for real-world solutions.
The way I'd approach it is start by binning. The range of numbers is presumably all 32-bit unsigned integers (or whatever they said). How large of an array does fit in RAM? divide the range into that many equal bins and pass through the data once. Look over the distribution: Is it fairly uniform, or spikey, or a curve of some kind? If the first/last range of bins are zeros then it gives you the true range of input values, and you can adjust the program to just bin over that range and repeat, to get better accuracy.
Then depending on the distribution, decide how to proceed. In general, only the top 100 bins can possibly contain the top 100 values, so you can reconfigure with those ranges and the largest bins you can handle within that excerpted range. If the distribution is too uniform, you might get many many bins with all the same count, so drop the smaller bins even though you have many more than 100 bins remaining -- you still cut it down some.
Worst case is that all the bins come out the same and you can't cut it down this way! Someone prepared some pathological data assuming this kind of approach. So re-arrange the way you do the binning. Rather than simply chopping into contiguous ranges of equal size, us a 1:1 mapping to shuffle them. However, for large bins, this might preserve the property of being fairly uniform, so you don't want a conventional "good" hashing function.
Another approach
If binning works, and rapidly cuts down the problem, it's easy. But the data could be such that it's actually very difficult. So what's a way that always works, regardless of the data? Well, I can assume that the result exists: some 100 values will have more occurrences.
Instead of bins, pick n specific values (however many you can fit in memory). Either choose random numbers, or use the first N distinct values from your input. Count those, and copy the others to another file. That is, the values you don't have room to count get copied to a (smaller the original) file.
Now you'll at least have a useful pivot value: the exact cardinality of the 100 distinct top values that you did count exactly. Well, the ones you picked might still end up being all the same count! So you only have 1 distinct cardinality worst case. You know that this is not a "top" value since there are far more an 100 of them.
Run again on your new (smaller) file, and discard counts that are smaller than the top 100 you already know. Repeat.
This reminds me of something that I might have read in Knuth's TAOCP, but scaled up for modern machine sizes.
I would just drop all the numbers in a database (SQLite would be my first choice) with a table like
CREATE TABLE tbl (
number INTEGER PRIMARY KEY,
counter INTEGER
)
Then for every number received, just do a
INSERT INTO tbl (number,counter) VALUES (:number,1) ON DUPLICATE KEY UPDATE counter=counter+1;
or with SQLite syntax
INSERT INTO tbl (number,counter) VALUES (:number,1) ON CONFLICT(number) DO UPDATE SET counter=counter+1;
Then when all the numbers are accounted for,
SELECT number, counter FROM tbl ORDER BY counter DESC LIMIT 100
... then I would end up with the 100 most common numbers, and how often they occurred. This scheme will only break when you run out of disk space... (or when you reach ~20000000000000 (20 trillion) unique digits at some ~281 terabytes of disk space... )
Divide your numbers into two buckets
Find top 100 in each bucket
Merge those top 100 lists.
To divide, do median of medians (which can be modified to make medians of the top/bottom as well).
Each bucket has a distinct range of numbers in it. The initial median split makes 2 buckets, each with half (about) as many elements as the entire list in it.
To find the top 100, first know if the bucket is narrow (similar minimum and maximum) O(1) or small (few numbers in it) (O(n) time O(n*bucket count) memory). If either is true, a simple counting pass (possibly doing more than 1 bucket at once) solves it (you will have to do it more than once probably, as you have memory limits).
If neither is true, recurse and divide that bucket into two.
There are going to be fiddly bits with how you recurse without wasting too much time.
But the idea is that each bucket exponentially gets narrower or smaller. Narrow buckets have a minimum and maximum that is close, and small buckets have few elements.
You merge buckets so that you have enough storage to count the elements in the bucket (either width based, or volume based). Then you do a pass that counts that bucket and finds the top 100, and repeat. Each time you merge the top 100 from the scan into the previous top 100.
In-place, no sorting of the entire list needed, and devolves to simpler and more optimal strategies when the initial "bucket" is narrow or small.
I assume that the point of the challenge is to process this large amount of data without consuming too much memory, and avoid parsing the input too many times.
Here's an algorithm that would require two not too large arrays. Don't know about java, but I am confident that this can be made to run very fast in C:
Create a Count array of size 2^n to count the number of input numbers based on their n most significant bits. That will require a first scan over the input data but is really straightforward to do. I would first try with n=20 (about one million buckets).
Obviously, we won't process the data one bucket at a time, as that would require reading the input a million times, instead we choose our optimal batch size B and allocate a Batch array of size B. B could be like 40M, so that we aim at reading the input about 100 times. (It all depends on available memory).
Then we iterate over the count array to group the first range of buckets so that the sum is close to, but doesn't exceed B.
For each such range, we parse the input data, look for numbers in range and copy those numbers to the batch array. Since we already know the size of each bucket, we can immediately copy them grouped per bucket, so that we only have to sort them bucket by bucket (you can repurpose the count array to store the indices for where to write the next entry). Next we count the identical items in the sorted batch array and keep track of the top 100 so far.
Proceed the next range of buckets for which the sum of counts is under size B, etc...
Optimizations:
Once we start having a decent top 100, you can skip entire buckets whose size is below our 100th entry. For this we can use a special value (such as -1) in the count array, to indicate there is not index. Depending on the data, this can drastically reduce the number of passes required.
When counting identical items in the sorted Batch, we can make jumps of the size of your 100th entry (and then take a few steps backwards. I can share pseudo-code if needed)
Potential issues with this approach:
The input numbers could be concentrated in a small range, then you might get one or more single buckets that are larger than B. Possible solutions:
You could try another selection of n bits instead (eg. the n least significant bits). Note that that still won't help if the same numbers appears a billion times.
If the input is 32bit integers, then the range of possible values is limited, and there can only be a few thousand different numbers in each bucket. So if one bucket is really large, then we can process that bucket differently: Just keep a counter for each unique value in that range. We can repurpose the Batch array for that.

Any good techniques for Memory Usage optimization for maximum String input length

I'm designing a string algorithm and the problem lies in the size of the input. By definition Java's maximum string length is 2147483647, to avoid confusion ~2.15x10^9.
Manacher's algorithm requires, by definition an array of characters:
char[n*2 +3] where n is the length of the input (string of size n)
By definition the maximum integer is the mentioned above ~2.15x10^9, so an array of characters can be of max size
char [ ~2.15x10^9 ];
This computation of manachers algorithm in java, lowers the limit of the inputed string to n = (~2.15x10^9 - 3) / 2 . To be accurate thats exactly 1073741822. ~1.1x10^9.
A char array of maximum length has (n*2) + 32 bytes = ( ~2.1x10^9 * 2 ) + 32 bytes = ~4.2x10^9 bytes (4.2GBs)
Theres additional arrays of various sizes, sets and other collections. I believe this will make the program take an entire space of ~~30GBs. for maximal input of RAM memory for the computation of the algorithm which we identified to be at most ~1.1x10^9 characters long.
Could you advice me on some technique to keep things even between "Longest possible string input" and "Memory management" ? Thank you
According to this article, the Manacher algorithm finds the longest palindromic substring in linear time (with n being the length of the original string).
Here's an implementation in Java, which shows the algorithm is also quite good at memory consumption (you need two arrays, one of chars and another one of ints, both having twice the length of the original string, and you also need to store the original string).
The problem is that your original string is extremely long, so you are reaching language limits, memory limits, etc.
On the other hand, your alphabet consists only of 7 characters: your original string characters A C T G, plus a letter separator (such as #) and start and end of string characters (such as $ and #). This means that you only need 3 bits to store each possible character. So, if you are willing to work with bitwise operators and bit masks, you could store 21 characters in a long (this is because a long is represented with 64 bits). This approach would be more complex to code, but it would use much less memory.
Another possible solution would be to work with dynamic structures instead of strings and arrays. These structures would use considerable memory, but it wouldn't be contiguous memory, meaning that you wouldn't reach max array size limits and language limits for ints, etc. This approach uses a suffix tree, which, according to this article, is a linear time approach. In that article there's a solution in C++. Good luck!

How to optimize checking if one of the given arrays is contained in yet another array

I have an array of integers which is updated every set interval of time with a new value (let's call it data). When that happens I want to check if that array contains any other array of integers from specified set (let's call that collection).
I do it like this:
separate a sub-array from the end of data of length X (arrays in the collection have a set max length of X);
iterate trough the collection and check if any array in it is contained in the separated data chunk;
It works, though it doesn't seem optimal. But every other idea I have involves creating more collections (e.g. create a collection of all the arrays from the original collection that end with the same integer as data, repeat). And that seems even more complex (on the other hand, it looks like the only way to deal with arrays in collections without limited max length).
Are there any standard algorithms to deal with such a problem? If not, are there any worthwhile optimizations I can apply to my approach?
EDIT:
To be precise, I:
separate a sub-array from the end of data of length X (arrays in the collection have a set max length of X and if the don't it's just the length of the longest one in the collection);
iterate trough the collection and for every array in it:
separate sub-array from the previous sub-array with length matching current array in collection;
use Java's List.equals to compare the arrays;
EDIT 2:
Thanks for all the replays, surely they'll come handy some day. In this case I decided to drop the last steps and just compare the arrays in my own loop. That eliminates creating yet another sub-array and it's already O(N), so in this specific case will do.
Take a look at the KMP algorithm. It's been designed with String matching in mind, but it really comes down to matching subsequences of arrays to given sequences. Since that algorithm has linear complexity (O(n)), it can be said that it's pretty optimal. It's also a basic staple in standard algorithms.
dfens proposal is smart in that it incurs no significant extra complexity iff you keep the current product along with the main array, and can be checked in O(1), but it is also quite fragile and produces many false positives and negatives. Just imagine a target array [1, 1, ..., 1], which will always produce a positive test for all non-trivial main arrays. It also breaks down when one bucket contains a 0. That means that a successful check against his test is always a necessary condition for a hit (0s aside), but is never sufficient - aka with that method alone, you can never be sure of the validity of that result.
look at the rsync algorithm... if i understand it correctly you could go about:
you've got a immense array of data [length L]
at the end of that data, you've got N Bytes of data, and you want to know whether those N bytes ever appeared before.
precalculate:
for every offset in the array, calculate the checksum over the next N data elements.
Hold that checksum in a seperate array.
Using a rolling checksum like rsync does, you can do this step in O(N) time for all elements..
Whenever new data arrives:
Calculate the checksum over the last N elements. Using a rolling checksum, this could be O(1)
Check that checksum against all the precalculated checksums. If it matches, check equality of the subarrays (subslices , whatever...). If that matches too, you've got a match.
I think, in essence this is the same as dfen's approach with the product of all numbers.
I think you can keep product of array to for immediate rejections.
So if your array is [n_1,n_2,n_3,...] you can say that it is not subarray of [m_1,m_2,m_3,...] if product m_1*m_2*... = M is not divisible by productn_1*n_2*... = N.
Example
Let's say you have array
[6,7]
And comparing with:
[6,10,12]
[4,6,7]
Product of you array is 6*7 = 42
6 * 10 * 12 = 720 which is not divisible by 42 so you can reject first array immediately
[4, 6, 7] is divisble by 42 (but you cannot reject it - it can have other multipliers)
In each interval of time you can just multiply product by new number to avoid computing whole product everytime.
Note that you don't have to allocate anything if you simulate List's equals yourself. Just one more loop.
Similar to dfens' answer, I'd offer other criteria:
As the product is too big to be handled efficiently, compute the GCD instead. It produces much more false positives, but surely fits in long or int or whatever your original datatype is.
Compute the total number of trailing zeros, i.e., the "product" ignoring all factors but powers of 2. Also pretty weak, but fast. Use this criterion before the more time-consuming ones.
But... this is a special case of DThought's answer. Use rolling hash.

Searching for a particular String in an array

I would like to know what is the fastest way / algorithm of checking for the existence of a word in a String array. For an example, if I have a String array with 10,000 elements, I would like to know whether it has the word "Human". I can sort the array, no problem.
However, binary search (Arrays.binarySearch()) is not allowed. Other collection types like HashSet, HashMap and ArrayList are not allowed too.
Is there any proven algorithm for this? Or any other method? The way of searching should be really really fast.
the fastest way you can sort will result in O(nLogn) complexity
so if you are looking for particular word in unordered data just scan the array with single for cycle, that will cost you O(n)
For fastest performance you have to use hashing .
You can use rolling hash .
It ensures lesser number of collisions .
hash = [0]*base^(n-1) + [1]*base^(n-2) + ... + [n-1]
where base is a prime number , say 31 .
You need to take modulo also , so integer range is not exceeded , by a prime number .
Time complexity : O(number of characters) considering multiplication and modulo O(1) operation .
A very good explaination is given here : Fast implementation of Rolling hash
Build a trie out of the array. It can be built in linear time (assuming a constant size alphabet). Then you can query in linear time as well (time proportional to the query word length). Both preprocessing and query time are asymptotically optimal.

External shuffle: shuffling large amount of data out of memory

I am looking for a way to shuffle a large amount of data which does not fit into memory (approx. 40GB).
I have around 30 millions entries, of variable length, stored in one large file. I know the starting and ending positions of each entry in that file. I need to shuffle this data which does not fit in the RAM.
The only solution I thought of is to shuffle an array containing the numbers from 1 to N, where N is the number of entries, with the Fisher-Yates algorithm and then copy the entries in a new file, according to this order. Unfortunately, this solution involves a lot of seek operations, and thus, would be very slow.
Is there a better solution to shuffle large amount of data with uniform distribution?
First get the shuffle issue out of your face. Do this by inventing a hash algorithm for your entries that produces random-like results, then do a normal external sort on the hash.
Now you have transformed your shuffle into a sort your problems turn into finding an efficient external sort algorithm that fits your pocket and memory limits. That should now be as easy as google.
A simple approach is to pick a K such that 1/K of the data fits comfortably in memory. Perhaps K=4 for your data, assuming you've got 16GB RAM. I'll assume your random number function has the form rnd(n) which generates a uniform random number from 0 to n-1.
Then:
for i = 0 .. K-1
Initialize your random number generator to a known state.
Read through the input data, generating a random number rnd(K) for each item as you go.
Retain items in memory whenever rnd(K) == i.
After you've read the input file, shuffle the retained data in memory.
Write the shuffled retained items to the output file.
This is very easy to implement, will avoid a lot of seeking, and is clearly correct.
An alternative is to partition the input data into K files based on the random numbers, and then go through each, shuffling in memory and writing to disk. This reduces disk IO (each item is read twice and written twice, compared to the first approach where each item is read K times and written once), but you need to be careful to buffer the IO to avoid a lot of seeking, it uses more intermediate disk, and is somewhat more difficult to implement. If you've got only 40GB of data (so K is small), then the simple approach of multiple iterations through the input data is probably best.
If you use 20ms as the time for reading or writing 1MB of data (and assuming the in-memory shuffling cost is insignificant), the simple approach will take 40*1024*(K+1)*20ms, which is 1 minute 8 seconds (assuming K=4). The intermediate-file approach will take 40*1024*4*20ms, which is around 55 seconds, assuming you can minimize seeking. Note that SSD is approximately 20 times faster for reads and writes (even ignoring seeking), so you should expect to perform this task in well under 10s using an SSD. Numbers from Latency Numbers every Programmer should know
I suggest keeping your general approach, but inverting the map before doing the actual copy. That way, you read sequentially and do scattered writes rather than the other way round.
A read has to be done when requested before the program can continue. A write can be left in a buffer, increasing the probability of accumulating more than one write to the same disk block before actually doing the write.
Premise
From what I understand, using the Fisher-Yates algorithm and the data you have about the positions of the entries, you should be able to obtain (and compute) a list of:
struct Entry {
long long sourceStartIndex;
long long sourceEndIndex;
long long destinationStartIndex;
long long destinationEndIndex;
}
Problem
From this point onward, the naive solution is to seek each entry in the source file, read it, then seek to the new position of the entry in the destination file and write it.
The problem with this approach is that it uses way too many seeks.
Solution
A better way to do it, is to reduce the number of seeks, using two huge buffers, for each of the files.
I recommend a small buffer for the source file (say 64MB) and a big one for the destination file (as big as the user can afford - say 2GB).
Initially, the destination buffer will be mapped to the first 2GB of the destination file. At this point, read the whole source file, in chunks of 64MB, in the source buffer. As you read it, copy the proper entries to the destination buffer. When you reach the end of the file, the output buffer should contain all the proper data. Write it to the destination file.
Next, map the output buffer to the next 2GB of the destination file and repeat the procedure. Continue until you have wrote the whole output file.
Caution
Since the entries have arbitrary sizes, it's very likely that at the beginning and ending of the buffers you will have suffixes and prefixes of entries, so you need to make sure you copy the data properly!
Estimated time costs
The execution time depends, essentially, on the size of the source file, the available RAM for the application and the reading speed of the HDD. Assuming a 40GB file, a 2GB RAM and a 200MB/s HDD read speed, the program will need to read 800GB of data (40GB * (40GB / 2GB)). Assuming the HDD is not highly fragmented, the time spent on seeks will be negligible. This means the reads will take up one hour! But if, luckily, the user has 8GB of RAM available for your application, the time may decrease to only 15 to 20 minutes.
I hope this will be enough for you, as I don't see any other faster way.
Although you can use external sort on a random key, as proposed by OldCurmudgeon, the random key is not necessary. You can shuffle blocks of data in memory, and then join them with a "random merge," as suggested by aldel.
It's worth specifying what "random merge" means more clearly. Given two shuffled sequences of equal size, a random merge behaves exactly as in merge sort, with the exception that the next item to be added to the merged list is chosen using a boolean value from a shuffled sequence of zeros and ones, with exactly as many zeros as ones. (In merge sort, the choice would be made using a comparison.)
Proving it
My assertion that this works isn't enough. How do we know this process gives a shuffled sequence, such that every ordering is equally possible? It's possible to give a proof sketch with a diagram and a few calculations.
First, definitions. Suppose we have N unique items, where N is an even number, and M = N / 2. The N items are given to us in two M-item sequences labeled 0 and 1 that are guaranteed to be in a random order. The process of merging them produces a sequence of N items, such that each item comes from sequence 0 or sequence 1, and the same number of items come from each sequence. It will look something like this:
0: a b c d
1: w x y z
N: a w x b y c d z
Note that although the items in 0 and 1 appear to be in order, they are just labels here, and the order doesn't mean anything. It just serves to connect the order of 0 and 1 to the order of N.
Since we can tell from the labels which sequence each item came from, we can create a "source" sequence of zeros and ones. Call that c.
c: 0 1 1 0 1 0 0 1
By the definitions above, there will always be exactly as many zeros as ones in c.
Now observe that for any given ordering of labels in N, we can reproduce a c sequence directly, because the labels preserve information about the sequence they came from. And given N and c, we can reproduce the 0 and 1 sequences. So we know there's always one path back from a sequence N to one triple (0, 1, c). In other words, we have a reverse function r defined from the set of all orderings of N labels to triples (0, 1, c) -- r(N) = (0, 1, c).
We also have a forward function f from any triple r(n) that simply re-merges 0 and 1 according to the value of c. Together, these two functions show that there is a one-to-one correspondence between outputs of r(N) and orderings of N.
But what we really want to prove is that this one-to-one correspondence is exhaustive -- that is, we want to prove that there aren't extra orderings of N that don't correspond to any triple, and that there aren't extra triples that don't correspond to any ordering of N. If we can prove that, then we can choose orderings of N in a uniformly random way by choosing triples (0, 1, c) in a uniformly random way.
We can complete this last part of the proof by counting bins. Suppose every possible triple gets a bin. Then we drop every ordering of N in the bin for the triple that r(N) gives us. If there are exactly as many bins as orderings, then we have an exhaustive one-to-one correspondence.
From combinatorics, we know that number of orderings of N unique labels is N!. We also know that the number of orderings of 0 and 1 are both M!. And we know that the number of possible sequences c is N choose M, which is the same as N! / (M! * (N - M)!).
This means there are a total of
M! * M! * N! / (M! * (N - M)!)
triples. But N = 2 * M, so N - M = M, and the above reduces to
M! * M! * N! / (M! * M!)
That's just N!. QED.
Implementation
To pick triples in a uniformly random way, we must pick each element of the triple in a uniformly random way. For 0 and 1, we accomplish that using a straightforward Fisher-Yates shuffle in memory. The only remaining obstacle is generating a proper sequence of zeros and ones.
It's important -- important! -- to generate only sequences with equal numbers of zeros and ones. Otherwise, you haven't chosen from among Choose(N, M) sequences with uniform probability, and your shuffle may be biased. The really obvious way to do this is to shuffle a sequence containing an equal number of zeros and ones... but the whole premise of the question is that we can't fit that many zeros and ones in memory! So we need a way to generate random sequences of zeros and ones that are constrained such that there are exactly as many zeros as ones.
To do this in a way that is probabilistically coherent, we can simulate drawing balls labeled zero or one from an urn, without replacement. Suppose we start with fifty 0 balls and fifty 1 balls. If we keep count of the number of each kind of ball in the urn, we can maintain a running probability of choosing one or the other, so that the final result isn't biased. The (suspiciously Python-like) pseudocode would be something like this:
def generate_choices(N, M):
n0 = M
n1 = N - M
while n0 + n1 > 0:
if randrange(0, n0 + n1) < n0:
yield 0
n0 -= 1
else:
yield 1
n1 -= 1
This might not be perfect because of floating point errors, but it will be pretty close to perfect.
This last part of the algorithm is crucial. Going through the above proof exhaustively makes it clear that other ways of generating ones and zeros won't give us a proper shuffle.
Performing multiple merges in real data
There remain a few practical issues. The above argument assumes a perfectly balanced merge, and it also assumes you have only twice as much data as you have memory. Neither assumption is likely to hold.
The fist turns out not to be a big problem because the above argument doesn't actually require equally sized lists. It's just that if the list sizes are different, the calculations are a little more complex. If you go through the above replacing the M for list 1 with N - M throughout, the details all line up the same way. (The pseudocode is also written in a way that works for any M greater than zero and less than N. There will then be exactly M zeros and M - N ones.)
The second means that in practice, there might be many, many chunks to merge this way. The process inherits several properties of merge sort — in particular, it requires that for K chunks, you'll have to perform roughly K / 2 merges, and then K / 4 merges, and so on, until all the data has been merged. Each batch of merges will loop over the entire dataset, and there will be roughly log2(K) batches, for a run time of O(N * log(K)). An ordinary Fisher-Yates shuffle would be strictly linear in N, and so in theory would be faster for very large K. But until K gets very, very large, the penalty may be much smaller than the disk seeking penalties.
The benefit of this approach, then, comes from smart IO management. And with SSDs it might not even be worth it — the seek penalties might not be large enough to justify the overhead of multiple merges. Paul Hankin's answer has some practical tips for thinking through the practical issues raised.
Merging all data at once
An alternative to doing multiple binary merges would be to merge all the chunks at once -- which is theoretically possible, and might lead to an O(N) algorithm. The random number generation algorithm for values in c would need to generate labels from 0 to K - 1, such that the final outputs have exactly the right number of labels for each category. (In other words, if you're merging three chunks with 10, 12, and 13 items, then the final value of c would need to have 0 ten times, 1 twelve times, and 2 thirteen times.)
I think there is probably an O(N) time, O(1) space algorithm that will do that, and if I can find one or work one out, I'll post it here. The result would be a truly O(N) shuffle, much like the one Paul Hankin describes towards the end of his answer.
Logically partition your database entries (for e.g Alphabetically)
Create indexes based on your created partitions
build DAO to sensitize based on index

Categories

Resources