How to compress floating point data? [closed] - java

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
I have read the research on SPDP: An Automatically Synthesized Lossless Compression Algorithm for Floating-Point Data https://userweb.cs.txstate.edu/~mb92/papers/dcc18.pdf
Now I would like to implement a program to simulate the compression of floating point data.
I do not know where to start. I have a text file with a set of real numbers inside.
I know that I have to use a mixing technique.
Better to use c or java?
I had thought about doing the XOR between the current value and the previous value. Then I count the frequency of these differences and finally I apply the Huffman algorithm.
Could it be right?
Any ideas to suggest?

According to the paper their code was compiled with gcc/g++ 5.3.1 using the “-O3 -march=native” flags so you can probably go with something like that. Also, this sounds like a short-run tool that would probably be better for C rather than Java anyway.
As for writing the algorithm, you will probably want to use the one they determined is best. In that case you'll need to read slowly and carefully what I have copied below. If there's anything you don't understand then you'll have to research further.
Carefully read the descriptions of each of the sub-algorithms (algorithmic components) and write their forward and reversed implementations - You need to write the reverse implementation so that you can decompress your data later.
Once you have all the sub-algorithms complete and tested, you can combine them as described into the synthesized algorithm. And also write the reversal for the synthesized algorithm.
The algorithmic components are described further farther below.
5.1. Synthesized Algorithm
SPDP, the best-compressing four-component algorithm for our datasets in CRUSHER’s
9,400,320-entry search space is LNVs2 | DIM8 LNVs1 LZa6. Whereas there has to be a reducer component at the end, none appear in the first three positions, i.e., CRUSHER generated a three-stage data model followed by a one-stage coder. This result shows that chaining whole compression algorithms, each of which would include a reducer, is not beneficial. Also, the Cut appears after the first component, so it is important to first treat the data at word granularity and then at byte granularity to maximize the compression ratio.
The LNVs2 component at the beginning that operates at 4-byte granularity is of particular interest. It subtracts the second-previous value from the current value in the sequence and emits the residual. This enables the algorithm to handle both single- and double-precision data well. In case of 8-byte doubles, it takes the upper half of the previous double and subtracts it from the upper half of the current double. Then it does the same for the lower halves. The result is, except for a suppressed carry, the same as computing the difference sequence on 8-byte values. In case of 4-byte single-precision data, this component also computes the difference sequence, albeit using the second-to-last rather than the last value. If the values are similar, which is where difference sequences help, then the second-previous value is also similar and should yield residuals that cluster around zero as well. This observation answers our first research question. We are able to learn from the synthesized algorithm, in this case how to handle mixed single/double-precision datasets.
The DIM8 component after the Cut separates the bytes making up the single or double values such that the most significant bytes are grouped together, followed by the second most significant bytes, etc. This is likely done because the most significant bytes, which hold the exponent and top mantissa bits in IEEE 754 floating-point values, correlate more with each other than with the remaining bytes in the same value. This assumption is supported by the LNVs1 component that follows, which computes the byte-granularity difference sequence and, therefore, exploits precisely this similarity between the bytes in the same position of consecutive values. The LZa6 component compresses the resulting difference sequence. It uses n = 6 to avoid bad matches that result in zero counts being emitted, which expand rather than compress the data. The chosen high value of n indicates that bad matches are frequent, as is expected with relatively random datasets (cf. Table 1).
2.1. Algorithmic Components
The DIMn component takes a parameter n that specifies the dimensionality and groups the values accordingly. For example, a dimension of three changes the linear sequence x1, y1, z1, x2, y2, z2, x3, y3, z3 into x1, x2, x3, y1, y2, y3, z1, z2, z3. We use n = 2, 4, 8, and 12.
The LNVkn component takes two parameters. It subtracts the last nth value from the current value and emits the residual. If k = ‘s’, arithmetic subtraction is used. If k = ‘x’, bitwise subtraction (xor) is used. In both cases, we tested n = 1, 2, 3, 4, 8, 12, 16, 32, and 64. None of the above components change the size of the data blocks. The next three components are the only ones that can reduce the length of a data block, i.e., compress it.
The LZln component implements a variant of the LZ77 algorithm (Ziv, J. and A. Lempel. “A Universal Algorithm for Data Compression.” IEEE Transaction
on Information Theory, Vol. 23, No. 3, pp. 337-343. 1977). It incorporates tradeoffs that make it more efficient than other LZ77 versions on hard-to-compress data and operates as follows. It uses a 32768-entry hash table to identify the l most recent prior occurrences of the current value. Then it checks whether the n values immediately preceding those locations match the n values just before the current location. If they do not, only the current value is emitted and the component advances to the next value. If the n values match, the component counts how many values following the current value match the values after that location. The length of the matching substring is emitted and the component advances by that many values. We consider n = 3, 4, 5, 6, and 7 combined with l = ‘a’, ‘b’, and ‘c’, where ‘a’ = 1, ‘b’ = 2, and ‘c’ = 4, which yields fifteen LZln components.
The │ pseudo component, called the Cut and denoted by a vertical bar, is a singleton component that converts a sequence of words into a sequence of bytes. Every algorithm produced by CRUSHER contains a Cut, which is included because it may be more effective to perform none, some, or all of the compression at byte rather than word granularity.
Remember that you'll need to also include the reversal of these algorithms if you want to decompress your data.
I hope this clarification helped, and best of luck!

Burtscher has several papers on floating point compression. Before jumping in to SPDP you might want to try this paper https://userweb.cs.txstate.edu/~burtscher/papers/tr08.pdf. The paper has a code listing on page 7; you might just copy and paste it in to a C file which you can experiment with before attempting harder algorithms.
Secondly, do not expect these FP compression algorithms to compress all floating point data. To get a good compression ratio neighboring FP values are expected to be numerically close to each other or exhibit some pattern that repeats itself. Burtscher uses a method called Finite Context Modeling (FCM) and differential FCM: "I have seen this pattern before; let me predict the next value and then XOR the actual and predicted values to achieve compression..."

Related

Java algorithm for evenly distributing ranges of strings into buckets [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
Short version - I'm looking for a Java algorithm that given a String and an integer representing a number of buckets returns which bucket to place the String into.
Long version - I need to distribute a large number of objects into bins, evenly (or approximately evenly). The number of bins/buckets will vary, so the algorithm can't assume a particular number of bins. It may be 1, 30, or 200. The key for these objects will be a String.
The String has some predictable qualities that are important. The first 2 characters of the string actually appear to be a hex representation of a byte. i.e. 00-ff , and the strings themselves are quite evenly distributed within that range. There are a couple of outliers that start differently though, so this can't be relied on 100% (though easily 99.999%). This just means that edge cases do need to be handled.
It's critical that once all the strings have been distributed that there is zero overlap in range between what values appear in any 2 bins. So, that if I know what range of values appear in a bin, I don't have to look in any other bins to find the object. So for example, if I had 2 bins, it could be that bin 0 has Strings starting with letters a-m and bin 1 starting with n-z. However, that wouldn't satisfy the need for even distribution given what we know about the Strings.
Lastly, the implementation can have no knowledge of the current state of the bins. The method signature should literally be:
public int determineBucketIndex(String key, int numBuckets);
I believe that the foreknowledge about the distribution of the Strings should be sufficient.
EDIT: Clarifying for some questions
Number of buckets can exceed 256. The strings do contain additional characters after the first 2, so this can be leveraged.
The buckets should hold a range of Strings to enable fast lookup later. In fact, that's why they're being binned to begin with. With only the knowledge of ranges, I should be able to look in exactly 1 bucket to see if the value is there or not. I shouldn't have to look in others.
Hashcodes won't work. I need the buckets to contain only String within a certain range of the String value (not the hash). Hashing would lose that.
EDIT 2: Apparently not communicating well.
After bins have been chosen, these values are written out to files. 1 file per bin. The system that uses these files after binning is NOT Java. It's already implemented, and it needs values in the bins that fit within a range. I repeat, hashcode will not work. I explicitly said the ranges for strings cannot overlap between two bins, using hashcode cannot work.
I have read through your question twice and I still don't understand the constraints. Therefore, I am making a suggestion here and you can give feedback on it. If this won't work, please explain why.
First, do some math on the number of bins, to determine how many bits you need for a unique bin number. Take the logarithm to base 2 of the number of bins, then take the ceiling of number of bits divided by 8. This is the number of bytes of data you need, numBytes.
Take the first two letters and convert them to a byte. Then grab numBytes - 1 characters and convert them to bytes. Take the ordinal value of the character ('A' becomes 65, and so on). If the next characters could be Unicode, pick some rule to convert them to bytes... probably grab the least significant byte (modulus by 256). Get numBytes bytes total, including the byte made from the first two letters, and convert to an integer. Make the byte from the first two letters the least significant 8 bits of the integer, the next byte the next 8 significant bits, and so on. Now simply take the modulus of this value by the number of bins, and you have an integer bin number.
If the string is too short and there are no more characters to turn into byte values, use 0 for each missing character.
If there are any predictable characters (for example, the third character is always a space) then don't use those characters; skip past them.
Now, if this doesn't work for you, please explain why, and then maybe we will understand the question well enough to answer it.
answer edited after 2 updates to original post
It would have been an excellent idea to include all the information in your question from the start - with your new edits, your description already gives you the answer: stick your objects into a Balanced Tree (giving you the homogenous distribution you say you need) based on the hashCode for your string's substring(0,2) or something similarly head-based. Then write each leaf (being a set of strings) in the BTree to file.
I seriously doubt that the problem, as described, can be done perfectly. How about this:
Create 257 bins.
Put all normal Strings into bins 0-255.
Put all the outliers into bin 256.
Other than the "even distribution", doesn't this meet all your requirements?
At this point, if you really want more even distribution, you could reorganize bins 0-255 into a smaller number of more evenly distributed bins. But I think you may just have to lesses the requirements there.

External shuffle: shuffling large amount of data out of memory

I am looking for a way to shuffle a large amount of data which does not fit into memory (approx. 40GB).
I have around 30 millions entries, of variable length, stored in one large file. I know the starting and ending positions of each entry in that file. I need to shuffle this data which does not fit in the RAM.
The only solution I thought of is to shuffle an array containing the numbers from 1 to N, where N is the number of entries, with the Fisher-Yates algorithm and then copy the entries in a new file, according to this order. Unfortunately, this solution involves a lot of seek operations, and thus, would be very slow.
Is there a better solution to shuffle large amount of data with uniform distribution?
First get the shuffle issue out of your face. Do this by inventing a hash algorithm for your entries that produces random-like results, then do a normal external sort on the hash.
Now you have transformed your shuffle into a sort your problems turn into finding an efficient external sort algorithm that fits your pocket and memory limits. That should now be as easy as google.
A simple approach is to pick a K such that 1/K of the data fits comfortably in memory. Perhaps K=4 for your data, assuming you've got 16GB RAM. I'll assume your random number function has the form rnd(n) which generates a uniform random number from 0 to n-1.
Then:
for i = 0 .. K-1
Initialize your random number generator to a known state.
Read through the input data, generating a random number rnd(K) for each item as you go.
Retain items in memory whenever rnd(K) == i.
After you've read the input file, shuffle the retained data in memory.
Write the shuffled retained items to the output file.
This is very easy to implement, will avoid a lot of seeking, and is clearly correct.
An alternative is to partition the input data into K files based on the random numbers, and then go through each, shuffling in memory and writing to disk. This reduces disk IO (each item is read twice and written twice, compared to the first approach where each item is read K times and written once), but you need to be careful to buffer the IO to avoid a lot of seeking, it uses more intermediate disk, and is somewhat more difficult to implement. If you've got only 40GB of data (so K is small), then the simple approach of multiple iterations through the input data is probably best.
If you use 20ms as the time for reading or writing 1MB of data (and assuming the in-memory shuffling cost is insignificant), the simple approach will take 40*1024*(K+1)*20ms, which is 1 minute 8 seconds (assuming K=4). The intermediate-file approach will take 40*1024*4*20ms, which is around 55 seconds, assuming you can minimize seeking. Note that SSD is approximately 20 times faster for reads and writes (even ignoring seeking), so you should expect to perform this task in well under 10s using an SSD. Numbers from Latency Numbers every Programmer should know
I suggest keeping your general approach, but inverting the map before doing the actual copy. That way, you read sequentially and do scattered writes rather than the other way round.
A read has to be done when requested before the program can continue. A write can be left in a buffer, increasing the probability of accumulating more than one write to the same disk block before actually doing the write.
Premise
From what I understand, using the Fisher-Yates algorithm and the data you have about the positions of the entries, you should be able to obtain (and compute) a list of:
struct Entry {
long long sourceStartIndex;
long long sourceEndIndex;
long long destinationStartIndex;
long long destinationEndIndex;
}
Problem
From this point onward, the naive solution is to seek each entry in the source file, read it, then seek to the new position of the entry in the destination file and write it.
The problem with this approach is that it uses way too many seeks.
Solution
A better way to do it, is to reduce the number of seeks, using two huge buffers, for each of the files.
I recommend a small buffer for the source file (say 64MB) and a big one for the destination file (as big as the user can afford - say 2GB).
Initially, the destination buffer will be mapped to the first 2GB of the destination file. At this point, read the whole source file, in chunks of 64MB, in the source buffer. As you read it, copy the proper entries to the destination buffer. When you reach the end of the file, the output buffer should contain all the proper data. Write it to the destination file.
Next, map the output buffer to the next 2GB of the destination file and repeat the procedure. Continue until you have wrote the whole output file.
Caution
Since the entries have arbitrary sizes, it's very likely that at the beginning and ending of the buffers you will have suffixes and prefixes of entries, so you need to make sure you copy the data properly!
Estimated time costs
The execution time depends, essentially, on the size of the source file, the available RAM for the application and the reading speed of the HDD. Assuming a 40GB file, a 2GB RAM and a 200MB/s HDD read speed, the program will need to read 800GB of data (40GB * (40GB / 2GB)). Assuming the HDD is not highly fragmented, the time spent on seeks will be negligible. This means the reads will take up one hour! But if, luckily, the user has 8GB of RAM available for your application, the time may decrease to only 15 to 20 minutes.
I hope this will be enough for you, as I don't see any other faster way.
Although you can use external sort on a random key, as proposed by OldCurmudgeon, the random key is not necessary. You can shuffle blocks of data in memory, and then join them with a "random merge," as suggested by aldel.
It's worth specifying what "random merge" means more clearly. Given two shuffled sequences of equal size, a random merge behaves exactly as in merge sort, with the exception that the next item to be added to the merged list is chosen using a boolean value from a shuffled sequence of zeros and ones, with exactly as many zeros as ones. (In merge sort, the choice would be made using a comparison.)
Proving it
My assertion that this works isn't enough. How do we know this process gives a shuffled sequence, such that every ordering is equally possible? It's possible to give a proof sketch with a diagram and a few calculations.
First, definitions. Suppose we have N unique items, where N is an even number, and M = N / 2. The N items are given to us in two M-item sequences labeled 0 and 1 that are guaranteed to be in a random order. The process of merging them produces a sequence of N items, such that each item comes from sequence 0 or sequence 1, and the same number of items come from each sequence. It will look something like this:
0: a b c d
1: w x y z
N: a w x b y c d z
Note that although the items in 0 and 1 appear to be in order, they are just labels here, and the order doesn't mean anything. It just serves to connect the order of 0 and 1 to the order of N.
Since we can tell from the labels which sequence each item came from, we can create a "source" sequence of zeros and ones. Call that c.
c: 0 1 1 0 1 0 0 1
By the definitions above, there will always be exactly as many zeros as ones in c.
Now observe that for any given ordering of labels in N, we can reproduce a c sequence directly, because the labels preserve information about the sequence they came from. And given N and c, we can reproduce the 0 and 1 sequences. So we know there's always one path back from a sequence N to one triple (0, 1, c). In other words, we have a reverse function r defined from the set of all orderings of N labels to triples (0, 1, c) -- r(N) = (0, 1, c).
We also have a forward function f from any triple r(n) that simply re-merges 0 and 1 according to the value of c. Together, these two functions show that there is a one-to-one correspondence between outputs of r(N) and orderings of N.
But what we really want to prove is that this one-to-one correspondence is exhaustive -- that is, we want to prove that there aren't extra orderings of N that don't correspond to any triple, and that there aren't extra triples that don't correspond to any ordering of N. If we can prove that, then we can choose orderings of N in a uniformly random way by choosing triples (0, 1, c) in a uniformly random way.
We can complete this last part of the proof by counting bins. Suppose every possible triple gets a bin. Then we drop every ordering of N in the bin for the triple that r(N) gives us. If there are exactly as many bins as orderings, then we have an exhaustive one-to-one correspondence.
From combinatorics, we know that number of orderings of N unique labels is N!. We also know that the number of orderings of 0 and 1 are both M!. And we know that the number of possible sequences c is N choose M, which is the same as N! / (M! * (N - M)!).
This means there are a total of
M! * M! * N! / (M! * (N - M)!)
triples. But N = 2 * M, so N - M = M, and the above reduces to
M! * M! * N! / (M! * M!)
That's just N!. QED.
Implementation
To pick triples in a uniformly random way, we must pick each element of the triple in a uniformly random way. For 0 and 1, we accomplish that using a straightforward Fisher-Yates shuffle in memory. The only remaining obstacle is generating a proper sequence of zeros and ones.
It's important -- important! -- to generate only sequences with equal numbers of zeros and ones. Otherwise, you haven't chosen from among Choose(N, M) sequences with uniform probability, and your shuffle may be biased. The really obvious way to do this is to shuffle a sequence containing an equal number of zeros and ones... but the whole premise of the question is that we can't fit that many zeros and ones in memory! So we need a way to generate random sequences of zeros and ones that are constrained such that there are exactly as many zeros as ones.
To do this in a way that is probabilistically coherent, we can simulate drawing balls labeled zero or one from an urn, without replacement. Suppose we start with fifty 0 balls and fifty 1 balls. If we keep count of the number of each kind of ball in the urn, we can maintain a running probability of choosing one or the other, so that the final result isn't biased. The (suspiciously Python-like) pseudocode would be something like this:
def generate_choices(N, M):
n0 = M
n1 = N - M
while n0 + n1 > 0:
if randrange(0, n0 + n1) < n0:
yield 0
n0 -= 1
else:
yield 1
n1 -= 1
This might not be perfect because of floating point errors, but it will be pretty close to perfect.
This last part of the algorithm is crucial. Going through the above proof exhaustively makes it clear that other ways of generating ones and zeros won't give us a proper shuffle.
Performing multiple merges in real data
There remain a few practical issues. The above argument assumes a perfectly balanced merge, and it also assumes you have only twice as much data as you have memory. Neither assumption is likely to hold.
The fist turns out not to be a big problem because the above argument doesn't actually require equally sized lists. It's just that if the list sizes are different, the calculations are a little more complex. If you go through the above replacing the M for list 1 with N - M throughout, the details all line up the same way. (The pseudocode is also written in a way that works for any M greater than zero and less than N. There will then be exactly M zeros and M - N ones.)
The second means that in practice, there might be many, many chunks to merge this way. The process inherits several properties of merge sort — in particular, it requires that for K chunks, you'll have to perform roughly K / 2 merges, and then K / 4 merges, and so on, until all the data has been merged. Each batch of merges will loop over the entire dataset, and there will be roughly log2(K) batches, for a run time of O(N * log(K)). An ordinary Fisher-Yates shuffle would be strictly linear in N, and so in theory would be faster for very large K. But until K gets very, very large, the penalty may be much smaller than the disk seeking penalties.
The benefit of this approach, then, comes from smart IO management. And with SSDs it might not even be worth it — the seek penalties might not be large enough to justify the overhead of multiple merges. Paul Hankin's answer has some practical tips for thinking through the practical issues raised.
Merging all data at once
An alternative to doing multiple binary merges would be to merge all the chunks at once -- which is theoretically possible, and might lead to an O(N) algorithm. The random number generation algorithm for values in c would need to generate labels from 0 to K - 1, such that the final outputs have exactly the right number of labels for each category. (In other words, if you're merging three chunks with 10, 12, and 13 items, then the final value of c would need to have 0 ten times, 1 twelve times, and 2 thirteen times.)
I think there is probably an O(N) time, O(1) space algorithm that will do that, and if I can find one or work one out, I'll post it here. The result would be a truly O(N) shuffle, much like the one Paul Hankin describes towards the end of his answer.
Logically partition your database entries (for e.g Alphabetically)
Create indexes based on your created partitions
build DAO to sensitize based on index

FastSineTransformer - pad array with zeros to fit length

I'm trying to implement a poisson solver for image blending in Java. After descretization with 5-star method, the real work begins.
To do that i do these three steps with the color values:
using sine transformation on rows and columns
multiply eigenvalues
using inverse sine transformation on rows an columns
This works so far.
To do the sine transformation in Java, i'm using the Apache Commons Math package.
But the FastSineTransformer has two limitations:
first value in the array must be zero (well that's ok, number two is the real problem)
the length of the input must be a power of two
So right now my excerpts are of the length 127, 255 and so on to fit in. (i'm inserting a zero in the beginning, so that 1 and 2 are fulfilled) That's pretty stupid, because i want to choose the size of my excerpt freely.
My Question is:
Is there a way to extend my array e.g. of length 100 to fit the limitations of the Apache FastSineTransformer?
In the FastFourierTransfomer class it is mentioned, that you can pad with zeros to get a power of two. But when i do that, i get wrong results. Perhaps i'm doing it wrong, but i really don't know if there is anything i have to keep in mind, when i'm padding with zeros
As far as I can tell from http://books.google.de/books?id=cOA-vwKIffkC&lpg=PP1&hl=de&pg=PA73#v=onepage&q&f=false and the sources http://grepcode.com/file/repo1.maven.org/maven2/org.apache.commons/commons-math3/3.2/org/apache/commons/math3/transform/FastSineTransformer.java?av=f
The rules are as follows:
According to implementation the dataset size should be a power of 2 - presumable in order for algorithm to guarantee O(n*log(n)) execution time.
According to James S. Walker function must be odd, that is the mentioned assumptions must be fullfiled and implementation trusts with that.
According to implementation for some reason the first and the middle element must be 0:
x'[0] = x[0] = 0,
x'[k] = x[k] if 1 <= k < N,
x'[N] = 0,
x'[k] = -x[2N-k] if N + 1 <= k < 2N.
As for your case when you may have a dataset which is not a power of two I suggest that you can resize and pad the gaps with zeroes with not violating the rules from the above. But I suggest referring to the book first.

Convert string to a large integer?

I have an assignment (i think a pretty common one) where the goal is to develop a LargeInteger class that can do calculations with.. very large integers.
I am obviously not allowed to use the Java.math.bigeinteger class at all.
Right off the top I am stuck. I need to take 2 Strings from the user (the long digits) and then I will be using these strings to perform the various calculation methods (add, divide, multiply etc.)
Can anyone explain to me the theory behind how this is supposed to work? After I take the string from the user (since it is too large to store in int) am I supposed to break it up maybe into 10 digit blocks of long numbers (I think 10 is the max long maybe 9?)
any help is appreciated.
First off, think about what a convenient data structure to store the number would be. Think about how you would store an N digit number into an int[] array.
Now let's take addition for example. How would you go about adding two N digit numbers?
Using our grade-school addition, first we look at the least significant digit (in standard notation, this would be the right-most digit) of both numbers. Then add them up.
So if the right-most digits were 7 and 8, we would obtain 15. Take the right-most digit of this result (5) and that's the least significant digit of the answer. The 1 is carried over to the next calculation. So now we look at the 2nd least significant digit and add those together along with the carry (if there is no carry, it is 0). And repeat until there are no digits left to add.
The basic idea is to translate how you add, multiply, etc by hand into code when the numbers are stored in some data structure.
I'll give you a few pointers as to what I might do with a similar task, but let you figure out the details.
Look at how addition is done from simple electronic adder circuits. Specifically, they use small blocks of addition combined together. These principals will help. Specifically, you can add the blocks, just remember to carry over from one block to the next.
Your idea of breaking it up into smaller blogs is an excellent one. Just remember to to the correct conversions. I suspect 9 digits is just about right, for the purpose of carry overs, etc.
These tasks will help you with addition and subtraction. Multiplication and Division are a bit trickier, but again, a few tips.
Multiplication is the easier of the tasks, just remember to multiply each block of one number with the other, and carry the zeros.
Integer division could basically be approached like long division, only using whole blocks at a time.
I've never actually build such a class, so hopefully there will be something in here you can use.
Look at the source code for MPI 1.8.6 by Michael Bromberger (a C library). It uses a simple data structure for bignums and simple algorithms. It's C, not Java, but straightforward.
Its division performs poorly (and results in slow conversion of very large bignums to tex), but you can follow the code.
There is a function mpi_read_radix to read a number in an arbitrary radix (up to base 36, where the letter Z is 35) with an optional leading +/- sign, and produce a bignum.
I recently chose that code for a programming language interpreter because although it is not the fastest performer out there, nor the most complete, it is very hackable. I've been able to rewrite the square root myself to a faster version, fix some coding bugs affecting a port to 64 bit digits, and add some missing operations that I needed. Plus the licensing is BSD compatible.

compress floating point numbers with specified range and precision

In my application I'm going to use floating point values to store geographical coordinates (latitude and longitude).
I know that the integer part of these values will be in range [-90, 90] and [-180, 180] respectively. Also I have requirement to enforce some fixed precision on these values (for now it is 0.00001 but can be changed later).
After studying single precision floating point type (float) I can see that it is just a little bit small to contain my values. That's because 180 * 10^5 is greater than 2^24 (size of the significand of float) but less than 2^25.
So I have to use double. But the problem is that I'm going to store huge amounts of this values, so I don't want to waste bytes, storing unnecessary precision.
So how can I perform some sort of compression when converting my double value (with fixed integer part range and specified precision X) to byte array in java? So for example if I use precision from my example (0.00001) I end up with 5 bytes for each value.
I'm looking for a lightweight algorithm or solution so that it doesn't imply huge calculations.
To store a number x to a fixed precision of (for instance) 0.00001, just store the integer closest to 100000 * x. (By the way, this requires 26 bits, not 25, because you need to store negative numbers too.)
As TonyK said in his answer, use an int to store the numbers.
To compress the numbers further, use locality: Geo coordinates are often "clumped" (say the outline of a city block). Use a fixed reference point (full 2x26 bits resolution) and then store offsets to the last coordinate as bytes (gives you +/-0.00127). Alternatively, use short which gives you more than half the value range.
Just be sure to hide the compression/decompression in a class which only offers double as outside API, so you can adjust the precision and the compression algorithm at any time.
Considering your use case, i would nonetheless use double and compress them directly.
The reason is that strong compressors, such as 7zip, are extremely good at handling "structured" data, which an array of double is (one data = 8 bytes, this is very regular & predictable).
Any other optimisation you may come up "by hand" is likely to be inferior or offer negligible advantage, while simultaneously costing you time and risks.
Note that you can still apply the "trick" of converting the double into int before compression, but i'm really unsure if it would bring you tangible benefit, while on the other hand it would seriously reduce your ability to cope with unforeseen ranges of figures in the future.
[Edit] Depending on source data, if "lower than precision level" bits are "noisy", it can be usefull for compression ratio to remove the noisy bits, either by rounding the value or even directly applying a mask on lowest bits (i guess this last method will not please purists, but at least you can directly select your precision level this way, while keeping available the full range of possible values).
So, to summarize, i'd suggest direct LZMA compression on your array of double.

Categories

Resources