Hashcode for 3D integer coordinates with high spatial coherence

Hashcode for 3D integer coordinates with high spatial coherence - java

this is my first question on these forums : )
I'm writing a coordinate class in Java for a spatial octree voxel system. These coordinates are not floating point coordinates, they are 4D integer indexes into the octree (3 normal dimensions X, Y, Z, and a forth for depth into the tree). The first 3 values are all shorts, the last dimension is a byte. In actual use right now only the first 11 bits of the shorts are used and only 3 bits of the byte, but this could be subject to change.
Now I'm trying to write a 'good' hash function for this class. The problem I'm wrestling with is that the coordinates are often going to be used in highly spatial coherent situations (hope I'm using the right terminology there). What I mean is that often times a coordinate will be hashed along with its immediately adjacent neighbors and other nearby coordinates.
Is there an effective practice to cause these 'near to each other' coordinates to produce significantly different hashcodes?

You are in luck: there is a way to get decent co-ordinate encodings with high spatial coherence using something called a Z-order curve.
The trick is to interleave the bits of the different co-ordinate components. So if you have 3 8-bit co-ordinates like:
[XXXXXXXX, YYYYYYYY, ZZZZZZZZ]
Then the z-curve encoded value would be a single 24-bit value:
XYZXYZXYZXYZXYZXYZXYZXYZ
You can extend to larger numbers of bits or co-ordinates as required.
This encoding works because co-ordinates which are close in space will have differences mainly in the lower order bits. So by interleaving the co-ordinates, you get the differences focused in the lower-order bits of the encoded value.
An extra interesting property is that the lower bits describe co-ordinates within cubes of space. So the lowest 3 bit address position with 2x2x2 cubes, the lowest 6 bits address position within 4*4*4 cubes, the lowest 9 bits position within 8*8*8 cubes etc. So this is actually a pretty ideal system for addressing co-ordinates within an octree.

"Significantly different" really depends on what you're doing with the hash code afterwards. In some cases it will then be subject to a round-robin bucket pick by taking the hash % size where size is the size of the hash map you're using, for example. Obviously that will change over time. I'd usually use something like:
int hash = 23;
hash = hash * 31 + x;
hash = hash * 31 + y;
hash = hash * 31 + z;
hash = hash * 31 + depth;
return hash;
(This is cribbed from Effective Java, basically.) Obviously it means that (x1, y1, z1) and (x1 + 1, y1 - 31, z1) would have the same hash code, but if you're mostly worried about very near neighbours it shouldn't be a problem.
EDIT: mikera's answer is likely to work better but be more complicated to code. I would personally try this very simple approach first, and see whether it's good enough for your actual use cases. Use progressively more effective but complicated approaches until you find one which is good enough.

Related

How to generate forests in java

I am creating a game where a landscape is generated all of the generations work perfectly, a week ago I have created a basic 'forest' generation system which just is a for loop that takes a chunk, and places random amounts of trees in random locations. But that does not give the result I would like to achieve.
Code:
for(int t = 0; t <= randomForTrees.nextInt(maxTreesPerChunk); t++){
// generates random locations for the X, Z positions\\
// the Y position is the height on the terrain gain with the X, Z coordinates \\
float TreeX = random.nextInt((int) (Settings.TERRAIN_VERTEX_COUNT + Settings.TERRAIN_SIZE)) + terrain.getX();
float TreeZ = random.nextInt((int) (Settings.TERRAIN_VERTEX_COUNT + Settings.TERRAIN_SIZE)) + terrain.getZ();
float TreeY = terrain.getTerrainHeightAtSpot(TreeX, TreeZ);
// creates a tree entity with the previous generated positions \\
Entity tree = new Entity(TreeStaticModel, new Vector3f(TreeX, TreeY, TreeZ), 0, random.nextInt(360), 0, 1);
// checks if the tree is on land \\
if(!(tree.getPosition().y <= -17)){
trees.add(tree);
}
}
Result:

First of all take a look at my:
simple C++ Island generator
as you can see you can compute Biomes from elevation, slope, etc... more sophisticated generators create a Voronoi map dividing your map into Biomes regions assigning randomly (with some rules) biome types based on neighbors already assigned...
Back to your question you should place your trees more dense around some position instead of uniformly cover large area with sparse trees... So you need slightly different kind of randomness distribution (like gauss). See the legendary:
Understanding “randomness”
on how to get a different distribution from uniform one...
So what you should do is get few random locations that would be covering your region shape uniformly. And then generate trees with density dependent on minimal distance to these points. The smaller distance the dense trees placement.

What are you looking for is a low-discrepancy-sequence to generate random numbers. The generated numbers are not truely random, but rather uniformly distributed. This distinguishes them from random number generators, which do not automatically produce uniformly distributed numbers.
One example of such a sequence would be the Halton Sequence, and Apache Commons also has an implementation which you can use.
double[] nextVector = generator.nextVector();
In your case, using two dimensions, the resulting array also has two entries. What you still need to do is to translate the points into your local coordinates by adding the the central point of the square where you want to place the forest to each generated vector. Also, to increase the gap between points, you should consider scaling the vectors.

How to compress floating point data? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
I have read the research on SPDP: An Automatically Synthesized Lossless Compression Algorithm for Floating-Point Data https://userweb.cs.txstate.edu/~mb92/papers/dcc18.pdf
Now I would like to implement a program to simulate the compression of floating point data.
I do not know where to start. I have a text file with a set of real numbers inside.
I know that I have to use a mixing technique.
Better to use c or java?
I had thought about doing the XOR between the current value and the previous value. Then I count the frequency of these differences and finally I apply the Huffman algorithm.
Could it be right?
Any ideas to suggest?

According to the paper their code was compiled with gcc/g++ 5.3.1 using the “-O3 -march=native” flags so you can probably go with something like that. Also, this sounds like a short-run tool that would probably be better for C rather than Java anyway.
As for writing the algorithm, you will probably want to use the one they determined is best. In that case you'll need to read slowly and carefully what I have copied below. If there's anything you don't understand then you'll have to research further.
Carefully read the descriptions of each of the sub-algorithms (algorithmic components) and write their forward and reversed implementations - You need to write the reverse implementation so that you can decompress your data later.
Once you have all the sub-algorithms complete and tested, you can combine them as described into the synthesized algorithm. And also write the reversal for the synthesized algorithm.
The algorithmic components are described further farther below.
5.1. Synthesized Algorithm
SPDP, the best-compressing four-component algorithm for our datasets in CRUSHER’s
9,400,320-entry search space is LNVs2 | DIM8 LNVs1 LZa6. Whereas there has to be a reducer component at the end, none appear in the first three positions, i.e., CRUSHER generated a three-stage data model followed by a one-stage coder. This result shows that chaining whole compression algorithms, each of which would include a reducer, is not beneficial. Also, the Cut appears after the first component, so it is important to first treat the data at word granularity and then at byte granularity to maximize the compression ratio.
The LNVs2 component at the beginning that operates at 4-byte granularity is of particular interest. It subtracts the second-previous value from the current value in the sequence and emits the residual. This enables the algorithm to handle both single- and double-precision data well. In case of 8-byte doubles, it takes the upper half of the previous double and subtracts it from the upper half of the current double. Then it does the same for the lower halves. The result is, except for a suppressed carry, the same as computing the difference sequence on 8-byte values. In case of 4-byte single-precision data, this component also computes the difference sequence, albeit using the second-to-last rather than the last value. If the values are similar, which is where difference sequences help, then the second-previous value is also similar and should yield residuals that cluster around zero as well. This observation answers our first research question. We are able to learn from the synthesized algorithm, in this case how to handle mixed single/double-precision datasets.
The DIM8 component after the Cut separates the bytes making up the single or double values such that the most significant bytes are grouped together, followed by the second most significant bytes, etc. This is likely done because the most significant bytes, which hold the exponent and top mantissa bits in IEEE 754 floating-point values, correlate more with each other than with the remaining bytes in the same value. This assumption is supported by the LNVs1 component that follows, which computes the byte-granularity difference sequence and, therefore, exploits precisely this similarity between the bytes in the same position of consecutive values. The LZa6 component compresses the resulting difference sequence. It uses n = 6 to avoid bad matches that result in zero counts being emitted, which expand rather than compress the data. The chosen high value of n indicates that bad matches are frequent, as is expected with relatively random datasets (cf. Table 1).
2.1. Algorithmic Components
The DIMn component takes a parameter n that specifies the dimensionality and groups the values accordingly. For example, a dimension of three changes the linear sequence x1, y1, z1, x2, y2, z2, x3, y3, z3 into x1, x2, x3, y1, y2, y3, z1, z2, z3. We use n = 2, 4, 8, and 12.
The LNVkn component takes two parameters. It subtracts the last nth value from the current value and emits the residual. If k = ‘s’, arithmetic subtraction is used. If k = ‘x’, bitwise subtraction (xor) is used. In both cases, we tested n = 1, 2, 3, 4, 8, 12, 16, 32, and 64. None of the above components change the size of the data blocks. The next three components are the only ones that can reduce the length of a data block, i.e., compress it.
The LZln component implements a variant of the LZ77 algorithm (Ziv, J. and A. Lempel. “A Universal Algorithm for Data Compression.” IEEE Transaction
on Information Theory, Vol. 23, No. 3, pp. 337-343. 1977). It incorporates tradeoffs that make it more efficient than other LZ77 versions on hard-to-compress data and operates as follows. It uses a 32768-entry hash table to identify the l most recent prior occurrences of the current value. Then it checks whether the n values immediately preceding those locations match the n values just before the current location. If they do not, only the current value is emitted and the component advances to the next value. If the n values match, the component counts how many values following the current value match the values after that location. The length of the matching substring is emitted and the component advances by that many values. We consider n = 3, 4, 5, 6, and 7 combined with l = ‘a’, ‘b’, and ‘c’, where ‘a’ = 1, ‘b’ = 2, and ‘c’ = 4, which yields fifteen LZln components.
The │ pseudo component, called the Cut and denoted by a vertical bar, is a singleton component that converts a sequence of words into a sequence of bytes. Every algorithm produced by CRUSHER contains a Cut, which is included because it may be more effective to perform none, some, or all of the compression at byte rather than word granularity.
Remember that you'll need to also include the reversal of these algorithms if you want to decompress your data.
I hope this clarification helped, and best of luck!

Burtscher has several papers on floating point compression. Before jumping in to SPDP you might want to try this paper https://userweb.cs.txstate.edu/~burtscher/papers/tr08.pdf. The paper has a code listing on page 7; you might just copy and paste it in to a C file which you can experiment with before attempting harder algorithms.
Secondly, do not expect these FP compression algorithms to compress all floating point data. To get a good compression ratio neighboring FP values are expected to be numerically close to each other or exhibit some pattern that repeats itself. Burtscher uses a method called Finite Context Modeling (FCM) and differential FCM: "I have seen this pattern before; let me predict the next value and then XOR the actual and predicted values to achieve compression..."

FastSineTransformer - pad array with zeros to fit length

I'm trying to implement a poisson solver for image blending in Java. After descretization with 5-star method, the real work begins.
To do that i do these three steps with the color values:
using sine transformation on rows and columns
multiply eigenvalues
using inverse sine transformation on rows an columns
This works so far.
To do the sine transformation in Java, i'm using the Apache Commons Math package.
But the FastSineTransformer has two limitations:
first value in the array must be zero (well that's ok, number two is the real problem)
the length of the input must be a power of two
So right now my excerpts are of the length 127, 255 and so on to fit in. (i'm inserting a zero in the beginning, so that 1 and 2 are fulfilled) That's pretty stupid, because i want to choose the size of my excerpt freely.
My Question is:
Is there a way to extend my array e.g. of length 100 to fit the limitations of the Apache FastSineTransformer?
In the FastFourierTransfomer class it is mentioned, that you can pad with zeros to get a power of two. But when i do that, i get wrong results. Perhaps i'm doing it wrong, but i really don't know if there is anything i have to keep in mind, when i'm padding with zeros

As far as I can tell from http://books.google.de/books?id=cOA-vwKIffkC&lpg=PP1&hl=de&pg=PA73#v=onepage&q&f=false and the sources http://grepcode.com/file/repo1.maven.org/maven2/org.apache.commons/commons-math3/3.2/org/apache/commons/math3/transform/FastSineTransformer.java?av=f
The rules are as follows:
According to implementation the dataset size should be a power of 2 - presumable in order for algorithm to guarantee O(n*log(n)) execution time.
According to James S. Walker function must be odd, that is the mentioned assumptions must be fullfiled and implementation trusts with that.
According to implementation for some reason the first and the middle element must be 0:
x'[0] = x[0] = 0,
x'[k] = x[k] if 1 <= k < N,
x'[N] = 0,
x'[k] = -x[2N-k] if N + 1 <= k < 2N.
As for your case when you may have a dataset which is not a power of two I suggest that you can resize and pad the gaps with zeroes with not violating the rules from the above. But I suggest referring to the book first.

compress floating point numbers with specified range and precision

In my application I'm going to use floating point values to store geographical coordinates (latitude and longitude).
I know that the integer part of these values will be in range [-90, 90] and [-180, 180] respectively. Also I have requirement to enforce some fixed precision on these values (for now it is 0.00001 but can be changed later).
After studying single precision floating point type (float) I can see that it is just a little bit small to contain my values. That's because 180 * 10^5 is greater than 2^24 (size of the significand of float) but less than 2^25.
So I have to use double. But the problem is that I'm going to store huge amounts of this values, so I don't want to waste bytes, storing unnecessary precision.
So how can I perform some sort of compression when converting my double value (with fixed integer part range and specified precision X) to byte array in java? So for example if I use precision from my example (0.00001) I end up with 5 bytes for each value.
I'm looking for a lightweight algorithm or solution so that it doesn't imply huge calculations.

To store a number x to a fixed precision of (for instance) 0.00001, just store the integer closest to 100000 * x. (By the way, this requires 26 bits, not 25, because you need to store negative numbers too.)

As TonyK said in his answer, use an int to store the numbers.
To compress the numbers further, use locality: Geo coordinates are often "clumped" (say the outline of a city block). Use a fixed reference point (full 2x26 bits resolution) and then store offsets to the last coordinate as bytes (gives you +/-0.00127). Alternatively, use short which gives you more than half the value range.
Just be sure to hide the compression/decompression in a class which only offers double as outside API, so you can adjust the precision and the compression algorithm at any time.

Considering your use case, i would nonetheless use double and compress them directly.
The reason is that strong compressors, such as 7zip, are extremely good at handling "structured" data, which an array of double is (one data = 8 bytes, this is very regular & predictable).
Any other optimisation you may come up "by hand" is likely to be inferior or offer negligible advantage, while simultaneously costing you time and risks.
Note that you can still apply the "trick" of converting the double into int before compression, but i'm really unsure if it would bring you tangible benefit, while on the other hand it would seriously reduce your ability to cope with unforeseen ranges of figures in the future.
[Edit] Depending on source data, if "lower than precision level" bits are "noisy", it can be usefull for compression ratio to remove the noisy bits, either by rounding the value or even directly applying a mask on lowest bits (i guess this last method will not please purists, but at least you can directly select your precision level this way, while keeping available the full range of possible values).
So, to summarize, i'd suggest direct LZMA compression on your array of double.

Scale numbers to be <= 255?

I have cells for whom the numeric value can be anything between 0 and Integer.MAX_VALUE. I would like to color code these cells correspondingly.
If the value = 0, then r = 0. If the value is Integer.MAX_VALUE, then r = 255. But what about the values in between?
I'm thinking I need a function whose limit as x => Integer.MAX_VALUE is 255. What is this function? Or is there a better way to do this?
I could just do (value / (Integer.MAX_VALUE / 255)) but that will cause many low values to be zero. So perhaps I should do it with a log function.
Most of my values will be in the range [0, 10,000]. So I want to highlight the differences there.

The "fairest" linear scaling is actually done like this:
floor(256 * value / (Integer.MAX_VALUE + 1))
Note that this is just pseudocode and assumes floating-point calculations.
If we assume that Integer.MAX_VALUE + 1 is 2^31, and that / will give us integer division, then it simplifies to
value / 8388608
Why other answers are wrong
Some answers (as well as the question itself) suggsted a variation of (255 * value / Integer.MAX_VALUE). Presumably this has to be converted to an integer, either using round() or floor().
If using floor(), the only value that produces 255 is Integer.MAX_VALUE itself. This distribution is uneven.
If using round(), 0 and 255 will each get hit half as many times as 1-254. Also uneven.
Using the scaling method I mention above, no such problem occurs.
Non-linear methods
If you want to use logs, try this:
255 * log(value + 1) / log(Integer.MAX_VALUE + 1)
You could also just take the square root of the value (this wouldn't go all the way to 255, but you could scale it up if you wanted to).

I figured a log fit would be good for this, but looking at the results, I'm not so sure.
However, Wolfram|Alpha is great for experimenting with this sort of thing:
I started with that, and ended up with:
r(x) = floor(((11.5553 * log(14.4266 * (x + 1.0))) - 30.8419) / 0.9687)
Interestingly, it turns out that this gives nearly identical results to Artelius's answer of:
r(x) = floor(255 * log(x + 1) / log(2^31 + 1)
IMHO, you'd be best served with a split function for 0-10000 and 10000-2^31.

For a linear mapping of the range 0-2^32 to 0-255, just take the high-order byte. Here is how that would look using binary & and bit-shifting:
r = value & 0xff000000 >> 24
Using mod 256 will certainly return a value 0-255, but you wont be able to draw any grouping sense from the results - 1, 257, 513, 1025 will all map to the scaled value 1, even though they are far from each other.
If you want to be more discriminating among low values, and merge many more large values together, then a log expression will work:
r = log(value)/log(pow(2,32))*256
EDIT: Yikes, my high school algebra teacher Mrs. Buckenmeyer would faint! log(pow(2,32)) is the same as 32*log(2), and much cheaper to evaluate. And now we can also factor this better, since 256/32 is a nice even 8:
r = 8 * log(value)/log(2)
log(value)/log(2) is actually log-base-2 of value, which log does for us very neatly:
r = 8 * log(value,2)
There, Mrs. Buckenmeyer - your efforts weren't entirely wasted!

In general (since it's not clear to me if this is a Java or Language-Agnostic question) you would divide the value you have by Integer.MAX_VALUE, multiply by 255 and convert to an integer.

This works! r= value /8421504;
8421504 is actually the 'magic' number, which equals MAX_VALUE/255. Thus, MAX_VALUE/8421504 = 255 (and some change, but small enough integer math will get rid of it.
if you want one that doesn't have magic numbers in it, this should work (and of equal performance, since any good compiler will replace it with the actual value:
r= value/ (Integer.MAX_VALUE/255);
The nice part is, this will not require any floating-point values.

The value you're looking for is: r = 255 * (value / Integer.MAX_VALUE). So you'd have to turn this into a double, then cast back to an int.

Note that if you want brighter and brighter, that luminosity is not linear so a straight mapping from value to color will not give a good result.
The Color class has a method to make a brighter color. Have a look at that.

The linear implementation is discussed in most of these answers, and Artelius' answer seems to be the best. But the best formula would depend on what you are trying to achieve and the distribution of your values. Without knowing that it is difficult to give an ideal answer.
But just to illustrate, any of these might be the best for you:
Linear distribution, each mapping onto a range which is 1/266th of the overall range.
Logarithmic distribution (skewed towards low values) which will highlight the differences in the lower magnitudes and diminish differences in the higher magnitudes
Reverse logarithmic distribution (skewed towards high values) which will highlight differences in the higher magnitudes and diminish differences in the lower magnitudes.
Normal distribution of incidence of colours, where each colour appears the same number of times as every other colour.
Again, you need to determine what you are trying to achieve & what the data will be used for. If you have been tasked to build this then I would strongly recommend you get this clarified to ensure that it is as useful as possible - and to avoid having to redevelop it later on.

Ask yourself the question, "What value should map to 128?"
If the answer is about a billion (I doubt that it is) then use linear.
If the answer is in the range of 10-100 thousand, then consider square root or log.
Another answer suggested this (I can't comment or vote yet). I agree.
r = log(value)/log(pow(2,32))*256

Here are a bunch of algorithms for scaling, normalizing, ranking, etc. numbers by using Extension Methods in C#, although you can adapt them to other languages:
http://www.redowlconsulting.com/Blog/post/2011/07/28/StatisticalTricksForLists.aspx
There are explanations and graphics that explain when you might want to use one method or another.

The best answer really depends on the behavior you want.
If you want each cell just to generally have a color different than the neighbor, go with what akf said in the second paragraph and use a modulo (x % 256).
If you want the color to have some bearing on the actual value (like "blue means smaller values" all the way to "red means huge values"), you would have to post something about your expected distribution of values. Since you worry about many low values being zero I might guess that you have lots of them, but that would only be a guess.
In this second scenario, you really want to distribute your likely responses into 256 "percentiles" and assign a color to each one (where an equal number of likely responses fall into each percentile).

If you are complaining that the low numbers are becoming zero, then you might want to normalize the values to 255 rather than the entire range of the values.
The formula would become:
currentValue / (max value of the set)

I could just do (value / (Integer.MAX_VALUE / 255)) but that will cause many low values to be zero.
One approach you could take is to use the modulo operator (r = value%256;). Although this wouldn't ensure that Integer.MAX_VALUE turns out as 255, it would guarantee a number between 0 and 255. It would also allow for low numbers to be distributed across the 0-255 range.
EDIT:
Funnily, as I test this, Integer.MAX_VALUE % 256 does result in 255 (I had originally mistakenly tested against %255, which yielded the wrong results). This seems like a pretty straight forward solution.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.