Integer compression in java - java

I have a sequence of Integers in the following format:
Integer1 Integer2 Integer3 Integer4 Integer5 ....
Each four consective integers corresponds to values of a single record. So, I cannot really order them.
What would be the best way to compress such file?
Updates:
1- The values are indpendent of each other. Each 4 consective integers represents a record, for example:
CustomerId PurchaseId Products MoneySpent
Each hold an integer value.
2- Ideally I would like to have it compressed as an object and on disk.
Thanks

The simplest and most compatible approach is to GZIP the file as you write it by wrapping your stream with GZIPOutputStream and reading it wrapped with GZIPInputStream.
InputStream in = new BufferedInputStream(new GZIPInputStream(new FileInputStream(filename)));
OutputStream out = new BufferedOutputStream(new GZIPOutputStream(new FileOutputStream(filename)));

Using GZip is not optimal in the given way. Since your OrderID, your PurcaseId, ProductID and MoneySpent are different to each other but all OrderIds have something in common as have PurcaseId, ProductId and MoneySpent. So it is best to store those values not row wise but column wise.
Since you usually have a sort order within this table you are about to store, one column could expressed by delta value. For example if you sort your values by OrderId you can express the sequence of 10, 23, 44, 53 as +10, +13, +21, +53. These numbers are smaller and more prone to repeat than the original number.
Integer values can be expressed as variable bit length information. First you store the number of bits of the value and than the actual value. This way you save a lot of leading zeros.
For money spent you can also think about the actual repetition of typical numbers like 99, 25, 50, 49 and so on for the cent values. It is more likely that a product has the price of 49,99 but not 51,23. So spliting the money integer into two values will give you the ability to use Huffman encoding and treat special values as symbols and the rest as runlength bits.
To express the bit length, you can also use different encoding schemes one would be yet again a huffman code of 64 symbols (64 different length information) and train a coding schema. This way you will end up with very less numbers of bits instead of writing integers or even longs.
The remaining stuff can be put into gzip. This works usually better depending on the way you express the bit length since it is easier to compress leading zeros than different bit length information but every compression cost.
Another coding scheme for bit lengths is using the min max approach.
For example for the above sequence 10, 23, 44, 53 we store 10, +43 (53), +13, +23. The idea is to know that between 10 and 53 there are 43 elements. So the next value has a maximum length of 6 (2^6 = 64) bits. This way there is no need for bit length information. You just store the sequence in the oder first minimum, next maximum, next minimum, next maximum and so on.
A more efficient scheme is using minimum, maximum, middle, middle left, middle right, middle left left, middle left right, middle right left, middle right right ... . This way you have the best chance to result in smallest bit length knowledge. Using this way results in very small sizes of the integers without additional bit length information.
Using such schemes often leaves GZip a chance of further reduction by < 10% resulting in the omitting of GZip at all.
[Summary]
So GZip is simple, if you need to squeeze out more, go for column wise instead of row/entry wise. Use special knowledge of each column. If sorted use deltas as representation. Use bit lengths informations being expressed by huffman codes (one for each column) and using values for cents and dollars for product prices often result in very good compression chances. Store sorted columns by deltas and use the tree wise storage resulting in very good knowledge about the bit length to expect next.

Related

How can I get the most frequent 100 numbers out of 4,000,000,000 numbers?

Yesterday in a coding interview I was asked how to get the most frequent 100 numbers out of 4,000,000,000 integers (may contain duplicates), for example:
813972066
908187460
365175040
120428932
908187460
504108776
The first approach that came to my mind was using HashMap:
static void printMostFrequent100Numbers() throws FileNotFoundException {
// Group unique numbers, key=number, value=frequency
Map<String, Integer> unsorted = new HashMap<>();
try (Scanner scanner = new Scanner(new File("numbers.txt"))) {
while (scanner.hasNextLine()) {
String number = scanner.nextLine();
unsorted.put(number, unsorted.getOrDefault(number, 0) + 1);
}
}
// Sort by frequency in descending order
List<Map.Entry<String, Integer>> sorted = new LinkedList<>(unsorted.entrySet());
sorted.sort((o1, o2) -> o2.getValue().compareTo(o1.getValue()));
// Print first 100 numbers
int count = 0;
for (Map.Entry<String, Integer> entry : sorted) {
System.out.println(entry.getKey());
if (++count == 100) {
return;
}
}
}
But it probably would throw an OutOfMemory exception for the data set of 4,000,000,000 numbers. Moreover, since 4,000,000,000 exceeds the maximum length of a Java array, let's say numbers are in a text file and they are not sorted. I assume multithreading or Map Reduce would be more appropriate for big data set?
How can the top 100 values be calculated when the data does not fit into the available memory?
If the data is sorted, you can collect the top 100 in O(n) where n is the data's size. Because the data is sorted, the distinct values are contiguous. Counting them while traversing the data once gives you the global frequency, which is not available to you when the data is not sorted.
See the sample code below on how this can be done. There is also an implementation (in Kotlin) of the entire approach on GitHub
Note: Sorting is not required. What is required is that distinct values are contiguous and so there is no need for ordering to be defined - we get this from sorting but perhaps there is a way of doing this more efficiently.
You can sort the data file using (external) merge sort in roughly O(n log n) by splitting the input data file into smaller files that fit into your memory, sorting and writing them out into sorted files then merging them.
About this code sample:
Sorted data is represented by a long[]. Because the logic reads values one by one, it's an OK approximation of reading the data from a sorted file.
The OP didn't specify how multiple values with equal frequency should be treated; consequently, the code doesn't do anything beyond ensuring that the result is top N values in no particular order and not implying that there aren't other values with the same frequency.
import java.util.*;
import java.util.Map.Entry;
class TopN {
private final int maxSize;
private Map<Long, Long> countMap;
public TopN(int maxSize) {
this.maxSize = maxSize;
this.countMap = new HashMap(maxSize);
}
private void addOrReplace(long value, long count) {
if (countMap.size() < maxSize) {
countMap.put(value, count);
} else {
Optional<Entry<Long, Long>> opt = countMap.entrySet().stream().min(Entry.comparingByValue());
Entry<Long, Long> minEntry = opt.get();
if (minEntry.getValue() < count) {
countMap.remove(minEntry.getKey());
countMap.put(value, count);
}
}
}
public Set<Long> get() {
return countMap.keySet();
}
public void process(long[] data) {
long value = data[0];
long count = 0;
for (long current : data) {
if (current == value) {
++count;
} else {
addOrReplace(value, count);
value = current;
count = 1;
}
}
addOrReplace(value, count);
}
public static void main(String[] args) {
long[] data = {0, 2, 3, 3, 4, 5, 5, 5, 5, 6, 6, 6, 7};
TopN topMap = new TopN(2);
topMap.process(data);
System.out.println(topMap.get()); // [5, 6]
}
}
Integers are signed 32 bits, so if only positive integers happen, we look at 2^31 max different entries. An array of 2^31 bytes should stay under max array size.
But that can't hold frequencies higher than 255, you would say? Yes, you're right.
So we add an hashmap for all entries that exceed the max value possible in your array (255 - if it's signed just start counting at -128). There are at most 16 million entries in this hash map (4 billion divided by 255), which should be possible.
We have two data structures:
a large array, indexed by the number read (0..2^31) of bytes.
a hashmap of (number read, frequency)
Algorithm:
while reading next number 'x'
{
if (hashmap.contains(x))
{
hashmap[x]++;
}
else
{
bigarray[x]++;
if (bigarray[x] > 250)
{
hashmap[x] = bigarray[x];
}
}
}
// when done:
// Look up top-100 in hashmap
// if not 100 yet, add more from bigarray, skipping those already taken from the hashmap
I'm not fluent in Java, so can't give a better code example.
Note that this algorithm is single-pass, works on unsorted input, and doesn't use external pre-processing steps.
All it does is assuming a maximum to the number read. It should work if the input are non-negative Integers, which have a maximum of 2^31. The sample input satisfies that constraint.
The algorithm above should satisfy most interviewers that ask this question. Whether you can code in Java should be established by a different question. This question is about designing data structures and efficient algorithms.
In pseudocode:
Perform an external sort
Do a pass to collect the top 100 frequencies (not which values have them)
Do another pass to collect the values that have those frequencies
Assumption: There are clear winners - no ties (outside the top 100).
Time complexity: O(n log n) (approx) due to sort.
Space complexity: Available memory, again due to sort.
Steps 2 and 3 are both O(n) time and O(1) space.
If there are no ties (outside the top 100), steps 2 and 3 can be combined into one pass, which wouldn’t improve the time complexity, but would improve the run time slightly.
If there are ties that would make the quantity of winners large, you couldn’t discover that and take special action (e.g., throw error or discard all ties) without two passes. You could however find the smallest 100 values from the ties with one pass.
But it probably would throw an OutOfMemory exception for the data set of 4000000000 numbers. Moreover, since 4000000000 exceeds max length of Java array, let's say numbers are in a text file and they are not sorted.
That depends on the value distribution. If you have 4E9 numbers, but the numbers are integers 1-1000, then you will end up with a map of 1000 entries. If the numbers are doubles or the value space is unrestricted, then you may have an issue.
As in the previous answer - there's a bug
unsorted.put(number, unsorted.getOrDefault(number, 0) + 1);
I personally would use "AtomicLong" for value, it allows to increase the value without updating the HashMap entries.
I assume multithreading or Map Reduce would be more appropriate for big data set?
What would be the most efficient solution for this problem?
This is a typical map-reduce exercise example, so in theory you could use multi-threaded or M-R approach. Maybe it's the goal of your exercise and you suppose to implement the multithreaded map-reduce tasks regardless if it's the most efficient way or not.
In reality you should calculate if it is worth the effort. If you're reading the input serially (as it's in your code using the Scanner), then definitely not. If you can split the input files and read multiple parts in parallel, considering the I/O throughput, it may be the case.
Or maybe if the value space is too large to fit into memory and you will need to downscale the dataset, you may consider different approach.
One option is a type of binary search. Consider a binary tree where each split corresponds to a bit in a 32-bit integer. So conceptually we have a binary tree of depth 32. At each node, we can compute the count of numbers in the set that start with the bit sequence for that node. This count is an O(n) operation, so the total cost of finding our most common sequence is going to be O(n * f(n)) where the function depends on how many nodes we need to enumerate.
Let's start by considering a depth-first search. This provides a reasonable upper bound to the stack size during enumeration. A brute force search of all nodes is obviously terrible (in that case, you can ignore the tree concept entirely and just enumerate over all the integers), but we have two things that can prevent us from needing to search all nodes:
If we ever reach a branch where there are 0 numbers in the set starting with that bit sequence, we can prune that branch and stop enumerating.
Once we hit a terminal node, we know how many occurrences of that specific number there are. We add this to our 'top 100' list, removing the lowest if necessary. Once this list fills up, we can start pruning any branches whose total count is lower than the lowest of the 'top 100' counts.
I'm not sure what the average and worst-case performance for this would be. It would tend to perform better for sets with fewer distinct numbers and probably performs worst for sets that approach uniformly distributed, since that implies more nodes will need to be searched.
A few observations:
There are at most N terminal nodes with non-zero counts, but since N > 2^32 in this specific case, that doesn't matter.
The total number of nodes for M leaf nodes (M = 2^32) is 2M-1. This is still linear in M, so worst case running time is bounded above at O(N*M).
This will perform worse than just searching all integers for some cases, but only by a linear scalar factor. Whether this performs better on average depends on the the expected data. For uniformly random data sets, my intuitive guess is that you'd be able to prune enough branches once the top-100 list fills up that you would tend to require fewer than M counts, but that would need to evaluated empirically or proven.
As a practical matter, the fact that this algorithm just requires read-only access to the data set (it only ever performs a count of numbers starting with a certain bit pattern) means it is amenable to parallelization by storing the data across multiple arrays, counting the subsets in parallel, then adding the counts together. This could be a pretty substantial speedup in a practical implementation that's harder to do with an approach that requires sorting.
A concrete example of how this might execute, for a simpler set of 3-bit numbers and only finding the single most frequent. Let's say the set is '000, 001, 100, 001, 100, 010'.
Count all numbers that start with '0'. This count is 4.
Go deeper, count all numbers that start with '00'. This count is 3.
Count all numbers that are '000'. This count is 1. This is our new most frequent.
Count all numbers that are '001'. This count is 2. This is our new most frequent.
Take next deep branch and count all numbers that start with '01'. This count is 1, which is less than our most frequent, so we can stop enumerating this branch.
Count all numbers that start with '1'. This count is 1, which is less than our most frequent, so we can stop enumerating this branch.
We're out of branches, so we're done and '001' is the most frequent.
Since the data set is presumably too big for memory, I'd do a hexadecimal radix sort. So the data set would get split between 16 files in each pass with as many passes as needed to get to the largest integer.
The second part would be to combine the files into one large data set.
The third part would be to read the file number by number and count the occurrence of each number. Save the number and number of occurrences into a two-dimensional array (the list) which is sorted by size. If the next number from the file has more occurrences than the number in the list with the lowest occurrences then replace that number.
Linux tools
That's simply done in a shell script on Linux/Mac:
sort inputfile | uniq -c | sort -nr | head -n 100
If the data is already sorted, you just use
uniq -c inputfile | sort -nr | head -n 100
File system
Another idea is to use the number as the filename and increase the file size for each hit
while read number;
do
echo -n "." >> number
done <<< inputfile
File system constraints could cause trouble with that many files, so you can create a directory tree with the first digits and store the files there.
When finished, you traverse through the tree and remember the 100 highest seen values for file size.
Database
You can use the same approach with a database, so you don't need to actually store the GB of data there (works too), just the counters (needs less space).
Interview
An interesting question would be how you handle edge cases, so what should happen if the 100th, 101st, ... number have the same frequency. Are the integers only positive?
What kind of output do they need, just the numbers or also the frequencies? Just think it through like a real task at work and ask everything you need to know to solve it. It's more about how you think and analyze a problem.
I have noticed there is a bug in this line.
unsorted.put(number, unsorted.getOrDefault(number, 1) + 1);
You should make the default value as 0 as you are then adding 1 to it. If not when you only have 1 occurrence of a value, it is recorded as the frequency of 2.
unsorted.put(number, unsorted.getOrDefault(number, 0) + 1);
One downside that I see is the unnecessity of keeping all 4 billion frequencies when you are sorting.
You can use a PriorityQueue to hold only 100 values.
Map<String, Integer> unsorted = new HashMap<>();
PriorityQueue<Map.Entry<String, Integer>> highestFrequentValues = new PriorityQueue<>(100,
(o1, o2) -> o2.getValue().compareTo(o1.getValue()));
// O(n)
try (Scanner scanner = new Scanner(new File("numbers.txt"))) {
while (scanner.hasNextLine()) {
String number = scanner.nextLine();
unsorted.put(number, unsorted.getOrDefault(number, 0) + 1);
}
}
// O(n)
for (Map.Entry<String, Integer> stringIntegerEntry : unsorted.entrySet()) {
if (highestFrequentValues.size() < 100) {
highestFrequentValues.add(stringIntegerEntry);
} else {
Map.Entry<String, Integer> minFrequencyWithinHundredEntries = highestFrequentValues.poll();
if (minFrequencyWithinHundredEntries.getValue() < stringIntegerEntry.getValue()) {
highestFrequentValues.add(stringIntegerEntry);
}
}
}
// O(n)
for (Map.Entry<String, Integer> frequentValue : highestFrequentValues) {
System.out.println(frequentValue.getKey());
}
OK, I know that the question is about Java and algorithms and solving this problem otherwise is not the point, but I still think this solution must be posted for completeness.
Solution in sh:
sort FILE | uniq -c | sort -nr | head -n 100
Explanation: sort | uniq -c lists only unique entries and counts the number of their occurrences in the input; sort -nr sorts the output numerically in reverse order (the lines with more occurrences on the top); head -n 100 keeps 100 top lines only. A file with 4,000,000,000 numbers up to 999999999 (as per OP) will take about ~40GB, so fits well on a disk of a single machine, so it is technically possible to use this solution.
Pro: simple, has constant and limited memory usage. Cons: sub-optimal (because of sort), consumes lots of the temporary disk space for the operation, and overall there is no doubt that a solution specifically designed for this problem will have a much better performance. The question remains (in all seriousness): in a general case, will writing (and then debugging and executing) an optimized solution take more or less time than using a sub-optimal one (as above) but available immediately? I ran the solution on a sample file with 400,000,000 lines (10x smaller) and it took about 7 minutes on my computer.
P.S. On a side note, OP mentions that this question was asked during a programming interview. This is interesting because I think this a kind of a solution worth mentioning in this context before starting to code another program from scratch. When people say "experienced engineers are 10x faster...", I personally don't think that this is because experienced engineers code faster or produce optimized algorithms off the top of the head, but because they explore the alternatives that can save time. In the context of an interview it is an important skill to demonstrate among others.
I suppose that 4 trillion was chosen to be sure the problem is too large to fit in memory on current desktop machines. So rent a large VM from Amazon or Microsoft for the purpose? That's an answer most people don't think of yet but is valid for real-world solutions.
The way I'd approach it is start by binning. The range of numbers is presumably all 32-bit unsigned integers (or whatever they said). How large of an array does fit in RAM? divide the range into that many equal bins and pass through the data once. Look over the distribution: Is it fairly uniform, or spikey, or a curve of some kind? If the first/last range of bins are zeros then it gives you the true range of input values, and you can adjust the program to just bin over that range and repeat, to get better accuracy.
Then depending on the distribution, decide how to proceed. In general, only the top 100 bins can possibly contain the top 100 values, so you can reconfigure with those ranges and the largest bins you can handle within that excerpted range. If the distribution is too uniform, you might get many many bins with all the same count, so drop the smaller bins even though you have many more than 100 bins remaining -- you still cut it down some.
Worst case is that all the bins come out the same and you can't cut it down this way! Someone prepared some pathological data assuming this kind of approach. So re-arrange the way you do the binning. Rather than simply chopping into contiguous ranges of equal size, us a 1:1 mapping to shuffle them. However, for large bins, this might preserve the property of being fairly uniform, so you don't want a conventional "good" hashing function.
Another approach
If binning works, and rapidly cuts down the problem, it's easy. But the data could be such that it's actually very difficult. So what's a way that always works, regardless of the data? Well, I can assume that the result exists: some 100 values will have more occurrences.
Instead of bins, pick n specific values (however many you can fit in memory). Either choose random numbers, or use the first N distinct values from your input. Count those, and copy the others to another file. That is, the values you don't have room to count get copied to a (smaller the original) file.
Now you'll at least have a useful pivot value: the exact cardinality of the 100 distinct top values that you did count exactly. Well, the ones you picked might still end up being all the same count! So you only have 1 distinct cardinality worst case. You know that this is not a "top" value since there are far more an 100 of them.
Run again on your new (smaller) file, and discard counts that are smaller than the top 100 you already know. Repeat.
This reminds me of something that I might have read in Knuth's TAOCP, but scaled up for modern machine sizes.
I would just drop all the numbers in a database (SQLite would be my first choice) with a table like
CREATE TABLE tbl (
number INTEGER PRIMARY KEY,
counter INTEGER
)
Then for every number received, just do a
INSERT INTO tbl (number,counter) VALUES (:number,1) ON DUPLICATE KEY UPDATE counter=counter+1;
or with SQLite syntax
INSERT INTO tbl (number,counter) VALUES (:number,1) ON CONFLICT(number) DO UPDATE SET counter=counter+1;
Then when all the numbers are accounted for,
SELECT number, counter FROM tbl ORDER BY counter DESC LIMIT 100
... then I would end up with the 100 most common numbers, and how often they occurred. This scheme will only break when you run out of disk space... (or when you reach ~20000000000000 (20 trillion) unique digits at some ~281 terabytes of disk space... )
Divide your numbers into two buckets
Find top 100 in each bucket
Merge those top 100 lists.
To divide, do median of medians (which can be modified to make medians of the top/bottom as well).
Each bucket has a distinct range of numbers in it. The initial median split makes 2 buckets, each with half (about) as many elements as the entire list in it.
To find the top 100, first know if the bucket is narrow (similar minimum and maximum) O(1) or small (few numbers in it) (O(n) time O(n*bucket count) memory). If either is true, a simple counting pass (possibly doing more than 1 bucket at once) solves it (you will have to do it more than once probably, as you have memory limits).
If neither is true, recurse and divide that bucket into two.
There are going to be fiddly bits with how you recurse without wasting too much time.
But the idea is that each bucket exponentially gets narrower or smaller. Narrow buckets have a minimum and maximum that is close, and small buckets have few elements.
You merge buckets so that you have enough storage to count the elements in the bucket (either width based, or volume based). Then you do a pass that counts that bucket and finds the top 100, and repeat. Each time you merge the top 100 from the scan into the previous top 100.
In-place, no sorting of the entire list needed, and devolves to simpler and more optimal strategies when the initial "bucket" is narrow or small.
I assume that the point of the challenge is to process this large amount of data without consuming too much memory, and avoid parsing the input too many times.
Here's an algorithm that would require two not too large arrays. Don't know about java, but I am confident that this can be made to run very fast in C:
Create a Count array of size 2^n to count the number of input numbers based on their n most significant bits. That will require a first scan over the input data but is really straightforward to do. I would first try with n=20 (about one million buckets).
Obviously, we won't process the data one bucket at a time, as that would require reading the input a million times, instead we choose our optimal batch size B and allocate a Batch array of size B. B could be like 40M, so that we aim at reading the input about 100 times. (It all depends on available memory).
Then we iterate over the count array to group the first range of buckets so that the sum is close to, but doesn't exceed B.
For each such range, we parse the input data, look for numbers in range and copy those numbers to the batch array. Since we already know the size of each bucket, we can immediately copy them grouped per bucket, so that we only have to sort them bucket by bucket (you can repurpose the count array to store the indices for where to write the next entry). Next we count the identical items in the sorted batch array and keep track of the top 100 so far.
Proceed the next range of buckets for which the sum of counts is under size B, etc...
Optimizations:
Once we start having a decent top 100, you can skip entire buckets whose size is below our 100th entry. For this we can use a special value (such as -1) in the count array, to indicate there is not index. Depending on the data, this can drastically reduce the number of passes required.
When counting identical items in the sorted Batch, we can make jumps of the size of your 100th entry (and then take a few steps backwards. I can share pseudo-code if needed)
Potential issues with this approach:
The input numbers could be concentrated in a small range, then you might get one or more single buckets that are larger than B. Possible solutions:
You could try another selection of n bits instead (eg. the n least significant bits). Note that that still won't help if the same numbers appears a billion times.
If the input is 32bit integers, then the range of possible values is limited, and there can only be a few thousand different numbers in each bucket. So if one bucket is really large, then we can process that bucket differently: Just keep a counter for each unique value in that range. We can repurpose the Batch array for that.

External shuffle: shuffling large amount of data out of memory

I am looking for a way to shuffle a large amount of data which does not fit into memory (approx. 40GB).
I have around 30 millions entries, of variable length, stored in one large file. I know the starting and ending positions of each entry in that file. I need to shuffle this data which does not fit in the RAM.
The only solution I thought of is to shuffle an array containing the numbers from 1 to N, where N is the number of entries, with the Fisher-Yates algorithm and then copy the entries in a new file, according to this order. Unfortunately, this solution involves a lot of seek operations, and thus, would be very slow.
Is there a better solution to shuffle large amount of data with uniform distribution?
First get the shuffle issue out of your face. Do this by inventing a hash algorithm for your entries that produces random-like results, then do a normal external sort on the hash.
Now you have transformed your shuffle into a sort your problems turn into finding an efficient external sort algorithm that fits your pocket and memory limits. That should now be as easy as google.
A simple approach is to pick a K such that 1/K of the data fits comfortably in memory. Perhaps K=4 for your data, assuming you've got 16GB RAM. I'll assume your random number function has the form rnd(n) which generates a uniform random number from 0 to n-1.
Then:
for i = 0 .. K-1
Initialize your random number generator to a known state.
Read through the input data, generating a random number rnd(K) for each item as you go.
Retain items in memory whenever rnd(K) == i.
After you've read the input file, shuffle the retained data in memory.
Write the shuffled retained items to the output file.
This is very easy to implement, will avoid a lot of seeking, and is clearly correct.
An alternative is to partition the input data into K files based on the random numbers, and then go through each, shuffling in memory and writing to disk. This reduces disk IO (each item is read twice and written twice, compared to the first approach where each item is read K times and written once), but you need to be careful to buffer the IO to avoid a lot of seeking, it uses more intermediate disk, and is somewhat more difficult to implement. If you've got only 40GB of data (so K is small), then the simple approach of multiple iterations through the input data is probably best.
If you use 20ms as the time for reading or writing 1MB of data (and assuming the in-memory shuffling cost is insignificant), the simple approach will take 40*1024*(K+1)*20ms, which is 1 minute 8 seconds (assuming K=4). The intermediate-file approach will take 40*1024*4*20ms, which is around 55 seconds, assuming you can minimize seeking. Note that SSD is approximately 20 times faster for reads and writes (even ignoring seeking), so you should expect to perform this task in well under 10s using an SSD. Numbers from Latency Numbers every Programmer should know
I suggest keeping your general approach, but inverting the map before doing the actual copy. That way, you read sequentially and do scattered writes rather than the other way round.
A read has to be done when requested before the program can continue. A write can be left in a buffer, increasing the probability of accumulating more than one write to the same disk block before actually doing the write.
Premise
From what I understand, using the Fisher-Yates algorithm and the data you have about the positions of the entries, you should be able to obtain (and compute) a list of:
struct Entry {
long long sourceStartIndex;
long long sourceEndIndex;
long long destinationStartIndex;
long long destinationEndIndex;
}
Problem
From this point onward, the naive solution is to seek each entry in the source file, read it, then seek to the new position of the entry in the destination file and write it.
The problem with this approach is that it uses way too many seeks.
Solution
A better way to do it, is to reduce the number of seeks, using two huge buffers, for each of the files.
I recommend a small buffer for the source file (say 64MB) and a big one for the destination file (as big as the user can afford - say 2GB).
Initially, the destination buffer will be mapped to the first 2GB of the destination file. At this point, read the whole source file, in chunks of 64MB, in the source buffer. As you read it, copy the proper entries to the destination buffer. When you reach the end of the file, the output buffer should contain all the proper data. Write it to the destination file.
Next, map the output buffer to the next 2GB of the destination file and repeat the procedure. Continue until you have wrote the whole output file.
Caution
Since the entries have arbitrary sizes, it's very likely that at the beginning and ending of the buffers you will have suffixes and prefixes of entries, so you need to make sure you copy the data properly!
Estimated time costs
The execution time depends, essentially, on the size of the source file, the available RAM for the application and the reading speed of the HDD. Assuming a 40GB file, a 2GB RAM and a 200MB/s HDD read speed, the program will need to read 800GB of data (40GB * (40GB / 2GB)). Assuming the HDD is not highly fragmented, the time spent on seeks will be negligible. This means the reads will take up one hour! But if, luckily, the user has 8GB of RAM available for your application, the time may decrease to only 15 to 20 minutes.
I hope this will be enough for you, as I don't see any other faster way.
Although you can use external sort on a random key, as proposed by OldCurmudgeon, the random key is not necessary. You can shuffle blocks of data in memory, and then join them with a "random merge," as suggested by aldel.
It's worth specifying what "random merge" means more clearly. Given two shuffled sequences of equal size, a random merge behaves exactly as in merge sort, with the exception that the next item to be added to the merged list is chosen using a boolean value from a shuffled sequence of zeros and ones, with exactly as many zeros as ones. (In merge sort, the choice would be made using a comparison.)
Proving it
My assertion that this works isn't enough. How do we know this process gives a shuffled sequence, such that every ordering is equally possible? It's possible to give a proof sketch with a diagram and a few calculations.
First, definitions. Suppose we have N unique items, where N is an even number, and M = N / 2. The N items are given to us in two M-item sequences labeled 0 and 1 that are guaranteed to be in a random order. The process of merging them produces a sequence of N items, such that each item comes from sequence 0 or sequence 1, and the same number of items come from each sequence. It will look something like this:
0: a b c d
1: w x y z
N: a w x b y c d z
Note that although the items in 0 and 1 appear to be in order, they are just labels here, and the order doesn't mean anything. It just serves to connect the order of 0 and 1 to the order of N.
Since we can tell from the labels which sequence each item came from, we can create a "source" sequence of zeros and ones. Call that c.
c: 0 1 1 0 1 0 0 1
By the definitions above, there will always be exactly as many zeros as ones in c.
Now observe that for any given ordering of labels in N, we can reproduce a c sequence directly, because the labels preserve information about the sequence they came from. And given N and c, we can reproduce the 0 and 1 sequences. So we know there's always one path back from a sequence N to one triple (0, 1, c). In other words, we have a reverse function r defined from the set of all orderings of N labels to triples (0, 1, c) -- r(N) = (0, 1, c).
We also have a forward function f from any triple r(n) that simply re-merges 0 and 1 according to the value of c. Together, these two functions show that there is a one-to-one correspondence between outputs of r(N) and orderings of N.
But what we really want to prove is that this one-to-one correspondence is exhaustive -- that is, we want to prove that there aren't extra orderings of N that don't correspond to any triple, and that there aren't extra triples that don't correspond to any ordering of N. If we can prove that, then we can choose orderings of N in a uniformly random way by choosing triples (0, 1, c) in a uniformly random way.
We can complete this last part of the proof by counting bins. Suppose every possible triple gets a bin. Then we drop every ordering of N in the bin for the triple that r(N) gives us. If there are exactly as many bins as orderings, then we have an exhaustive one-to-one correspondence.
From combinatorics, we know that number of orderings of N unique labels is N!. We also know that the number of orderings of 0 and 1 are both M!. And we know that the number of possible sequences c is N choose M, which is the same as N! / (M! * (N - M)!).
This means there are a total of
M! * M! * N! / (M! * (N - M)!)
triples. But N = 2 * M, so N - M = M, and the above reduces to
M! * M! * N! / (M! * M!)
That's just N!. QED.
Implementation
To pick triples in a uniformly random way, we must pick each element of the triple in a uniformly random way. For 0 and 1, we accomplish that using a straightforward Fisher-Yates shuffle in memory. The only remaining obstacle is generating a proper sequence of zeros and ones.
It's important -- important! -- to generate only sequences with equal numbers of zeros and ones. Otherwise, you haven't chosen from among Choose(N, M) sequences with uniform probability, and your shuffle may be biased. The really obvious way to do this is to shuffle a sequence containing an equal number of zeros and ones... but the whole premise of the question is that we can't fit that many zeros and ones in memory! So we need a way to generate random sequences of zeros and ones that are constrained such that there are exactly as many zeros as ones.
To do this in a way that is probabilistically coherent, we can simulate drawing balls labeled zero or one from an urn, without replacement. Suppose we start with fifty 0 balls and fifty 1 balls. If we keep count of the number of each kind of ball in the urn, we can maintain a running probability of choosing one or the other, so that the final result isn't biased. The (suspiciously Python-like) pseudocode would be something like this:
def generate_choices(N, M):
n0 = M
n1 = N - M
while n0 + n1 > 0:
if randrange(0, n0 + n1) < n0:
yield 0
n0 -= 1
else:
yield 1
n1 -= 1
This might not be perfect because of floating point errors, but it will be pretty close to perfect.
This last part of the algorithm is crucial. Going through the above proof exhaustively makes it clear that other ways of generating ones and zeros won't give us a proper shuffle.
Performing multiple merges in real data
There remain a few practical issues. The above argument assumes a perfectly balanced merge, and it also assumes you have only twice as much data as you have memory. Neither assumption is likely to hold.
The fist turns out not to be a big problem because the above argument doesn't actually require equally sized lists. It's just that if the list sizes are different, the calculations are a little more complex. If you go through the above replacing the M for list 1 with N - M throughout, the details all line up the same way. (The pseudocode is also written in a way that works for any M greater than zero and less than N. There will then be exactly M zeros and M - N ones.)
The second means that in practice, there might be many, many chunks to merge this way. The process inherits several properties of merge sort — in particular, it requires that for K chunks, you'll have to perform roughly K / 2 merges, and then K / 4 merges, and so on, until all the data has been merged. Each batch of merges will loop over the entire dataset, and there will be roughly log2(K) batches, for a run time of O(N * log(K)). An ordinary Fisher-Yates shuffle would be strictly linear in N, and so in theory would be faster for very large K. But until K gets very, very large, the penalty may be much smaller than the disk seeking penalties.
The benefit of this approach, then, comes from smart IO management. And with SSDs it might not even be worth it — the seek penalties might not be large enough to justify the overhead of multiple merges. Paul Hankin's answer has some practical tips for thinking through the practical issues raised.
Merging all data at once
An alternative to doing multiple binary merges would be to merge all the chunks at once -- which is theoretically possible, and might lead to an O(N) algorithm. The random number generation algorithm for values in c would need to generate labels from 0 to K - 1, such that the final outputs have exactly the right number of labels for each category. (In other words, if you're merging three chunks with 10, 12, and 13 items, then the final value of c would need to have 0 ten times, 1 twelve times, and 2 thirteen times.)
I think there is probably an O(N) time, O(1) space algorithm that will do that, and if I can find one or work one out, I'll post it here. The result would be a truly O(N) shuffle, much like the one Paul Hankin describes towards the end of his answer.
Logically partition your database entries (for e.g Alphabetically)
Create indexes based on your created partitions
build DAO to sensitize based on index

Packing many bounded integers into a large single integer

I have to store millions of entries in a database. Each entry is identified by a set of unique integer identifiers. For example a value may be identified by a set of 10 integer identifiers, each of which are less than 100 million.
In order to reduce the size of the database, I thought of the following encoding using a single 32 bit integer value.
Identifier 1: 0 - 100,000,000
Identifier 2: 100,000,001 - 200,000,000
.
.
.
Identifier 10: 900,000,001 - 1,000,000,000
I am using Java. I can write a simple method to encode/decode. The user code does not have to know that I am encoding/decoding during fetch/store.
What I want to know is: what is the most efficient (fastest) and recommended way to implement such encoding/decoding. A simple implementation will perform a large number of multiplications/subtractions.
Is it possible to use shifts (or bitwise operations) and choose different partition size (the size of each segment still has to be close to 100 million)?
I am open to any suggestions, ideas, or even a totally different scheme. I want to exploit the fact that the integer identifiers are bounded to drastically reduce the storage size without noticeably compromising performance.
Edit: I just wanted to add that I went through some of the answers posted on this forum. A common solution was to split the bits for each identifier. If I use 2 bits for each identifier for a total of 10 identifiers, then my range of identifiers gets severely limited.
It sounds like you want to pack multiple integer values of 0...100m into a single 32bit Integer? Unless you are omitting important information that would allow to store these 0...100m values more efficiently, there is simply no way to do it.
ceil(log2(100m)) = 27bit, which means you have only 5 "spare bits".
You can make the segmentation size 27 bits which gives you 32 * 128 M segements. instead of 42 * 100 M
int value =
int high = value >>> 27;
int low = value & ((1L << 27) -1);
It is worth nothing this calculation is likely to be trivial compared to the cost of using a database.
It's unclear what you actually want to do, but it sounds like you want an integer value, each bit representing having a particular attribute, and applying a bitmask.
A 32-bit integer can save 32 different attributes, 64-bit 64 etc. To have more, you'll need multiple integer columns.
If that's not it, I don't know what you mean by "encode".

How to manage and manipulate extremely large binary values

I need to read in a couple of extremely large strings which are comprised of binary digits. These strings can be extremely large (up to 100,000 digits) which I need to store, be able to manipulate (flip bits) and add together. My first though was to split the string in to 8 character chunks, convert them to bytes and store them in an array. This would allow me to flip bits with relative ease given an index of the bit needed to be flipped, but with this approach I'm unsure how I would go about adding the entirety of the two values together.
Can anyone see a way of storing these values in a memory efficient manner which would allow me to be able to still be able to perform calculations on them?
EDIT:
"add together" (concatenate? arithmetic addition?) - arithmetic addition
My problem is that in the hardest case I have two 100,000 bit numbers (stored in an array of 12,500 bytes). Storing and manually flipping bits isn't an issue, but I need the sum of both numbers and then to be able to find out what the xth bit of this is.
"Strings of binary digits" definitely sound like byte arrays to me. To "add" two such byte arrays together, you'd just allocate a new byte array which is big enough to hold everything, and copy the contents using System.arraycopy.
However that assumes each "string" is a multiple of 8 bits. If you want to "add" a string of 15 bits to another string of 15 bits, you'll need to do bit-shifting. Is that likely to be a problem for you? Depending on what operations you need, you may even want to just keep an object which knows about two byte arrays and can find an arbitrary bit in the logically joined "string".
Either way, byte[] is going to be the way forward - or possibly BitSet.
What about
// Addition
byte[] resArr = new byte[byteArr1.length];
for (int i=0; i<byteArr1.length; i++)
{
res = byteArr1[i]+byteArr2[i];
}
?
Is it something like this you are trying to do?

how can i generate a unique int from a unique string?

I have an object with a String that holds a unique id .
(such as "ocx7gf" or "67hfs8")
I need to supply it an implementation of int hascode() which will be unique obviously.
how do i cast a string to a unique int in the easiest/fastest way?
10x.
Edit - OK. I already know that String.hashcode is possible. But it is not recommended in any place. Actually' if any other method is not recommended - Should I use it or not if I have my object in a collection and I need the hashcode. should I concat it to another string to make it more successful?
No, you don't need to have an implementation that returns a unique value, "obviously", as obviously the majority of implementations would be broken.
What you want to do, is to have a good spread across bits, especially for common values (if any values are more common than others). Barring special knowledge of your format, then just using the hashcode of the string itself would be best.
With special knowledge of the limits of your id format, it may be possible to customise and result in better performance, though false assumptions are more likely to make things worse than better.
Edit: On good spread of bits.
As stated here and in other answers, being completely unique is impossible and hash collisions are possible. Hash-using methods know this and can deal with it, but it does impact upon performance, so we want collisions to be rare.
Further, hashes are generally re-hashed so our 32-bit number may end up being reduced to e.g. one in the range 0 to 22, and we want as good a distribution within that as possible to.
We also want to balance this with not taking so long to compute our hash, that it becomes a bottleneck in itself. An imperfect balancing act.
A classic example of a bad hash method is one for a co-ordinate pair of X, Y ints that does:
return X ^ Y;
While this does a perfectly good job of returning 2^32 possible values out of the 4^32 possible inputs, in real world use it's quite common to have sets of coordinates where X and Y are equal ({0, 0}, {1, 1}, {2, 2} and so on) which all hash to zero, or matching pairs ({2,3} and {3, 2}) which will hash to the same number. We are likely better served by:
return ((X << 16) | (x >> 16)) ^ Y;
Now, there are just as many possible values for which this is dreadful than for the former, but it tends to serve better in real-world cases.
Of course, there is a different job if you are writing a general-purpose class (no idea what possible inputs there are) or have a better idea of the purpose at hand. For example, if I was using Date objects but knew that they would all be dates only (time part always midnight) and only within a few years of each other, then I might prefer a custom hash code that used only the day, month and lower-digits of the years, over the standard one. The writer of Date though can't work on such knowledge and has to try to cater for everyone.
Hence, If I for instance knew that a given string is always going to consist of 6 case-insensitive characters in the range [a-z] or [0-9] (which yours seem to, but it isn't clear from your question that it does) then I might use an algorithm that assigned a value from 0 to 35 (the 36 possible values for each character) to each character, and then walk through the string, each time multiplying the current value by 36 and adding the value of the next char.
Assuming a good spread in the ids, this would be the way to go, especially if I made the order such that the lower-significant digits in my hash matched the most frequently changing char in the id (if such a call could be made), hence surviving re-hashing to a smaller range well.
However, lacking such knowledge of the format for sure, I can't make that call with certainty, and I could well be making things worse (slower algorithm for little or even negative gain in hash quality).
One advantage you have is that since it's an ID in itself, then presumably no other non-equal object has the same ID, and hence no other properties need be examined. This doesn't always hold.
You can't get a unique integer from a String of unlimited length. There are 4 billionish (2^32) unique integers, but an almost infinite number of unique strings.
String.hashCode() will not give you unique integers, but it will do its best to give you differing results based on the input string.
EDIT
Your edited question says that String.hashCode() is not recommended. This is not true, it is recommended, unless you have some special reason not to use it. If you do have a special reason, please provide details.
Looks like you've got a base-36 number there (a-z + 0-9). Why not convert it to an int using Integer.parseInt(s, 36)? Obviously, if there are too many unique IDs, it won't fit into an int, but in that case you're out of luck with unique integers and will need to get by using String.hashCode(), which does its best to be close to unique.
Unless your strings are limited in some way or your integers hold more bits than the strings you're trying to convert, you cannot guarantee the uniqueness.
Let's say you have a 32 bit integer and a 64-character character set for your strings. That means six bits per character. That will allow you to store five characters into an integer. More than that and it won't fit.
represent each string character by a five-digit binary digit, eg. a by 00001 b by 00010 etc. thus 32 combinations are possible, for example, cat might be written as 00100 00001 01100, then convert this binary into decimal, eg. this would be 4140, thus cat would be 4140, similarly, you can get cat back from 4140 by converting it to binary first and Map the five digit binary to string
One way to do it is assign each letter a value, and each place of the string it's own multiple ie a = 1, b = 2, and so on, then everything in the first digit (read left to right) would be multiplied by a prime number, the next the next prime number and so on, such that the final digit was multiplied by a prime larger than the number of possible subsets in that digit (26+1 for a space or 52+1 with capitols and so on for other supported characters). If the number is mapped back to the first digits (leftmost character) any number you generate from a unique string mapping back to 1 or 6 whatever the first letter will be, gives a unique value.
Dog might be 30,3(15),101(7) or 782, while God 33,3(15),101(4) or 482. More importantly than unique strings being generated they can be useful in generation if the original digit is kept, like 30(782) would be unique to some 12(782) for the purposes of differentiating like strings if you ever managed to go over the unique possibilities. Dog would always be Dog, but it would never be Cat or Mouse.

Categories

Resources