Hashing to uniformly distribute value over a large range - java

I want to devise an algorithm which takes a set of values and distributes it uniformly over a much larger range. eg. i have 1000 values and want to distribute them over a range of value 2^16.
Also, the input values can change continuously and i need to keep parsing each input value through the hash function so that it gets distributed uniformly over my output range.
What hashing algorithm should i use for this?
I am writing the code in Java.

If you're just hashing integers, here's one way.
public class Hasho {
private static final Long LARGE_PRIME = 948701839L;
private static final Long LARGE_PRIME2 = 6920451961L;
public static void main(String[] args) {
for (int i = 0; i < 100; i++) {
System.out.println(i + " -> " + hash(i));
}
}
public static int hash(int i) {
// Spread out values
long scaled = (long) i * LARGE_PRIME;
// Fill in the lower bits
long shifted = scaled + LARGE_PRIME2;
// Add to the lower 32 bits the upper bits which would be lost in
// the conversion to an int.
long filled = shifted + ((shifted & 0xFFFFFFFF00000000L) >> 32);
// Pare it down to 31 bits in this case. Replace 7 with F if you
// want negative numbers or leave off the `& mask` part entirely.
int masked = (int) (filled & 0x7FFFFFFF);
return masked;
}
}
This is merely an example to show how it can be done. There is some serious math in a professional quality hash function.

I'm sure this has a name, but this is what we used to do with ISAM files back in the dark ages
Increment a number eg 16001
Reverse the String ie. 10061 and you have your hash
You might want to reverse the string bitwise
This produces a nice even spread. we used to use it with job numbers so that you could retrieve the job fairly easily, so if you have a 'magic number' candidate this can be useful.

Related

Printing PowerSet with help of bit position

Googling around for a while to find subsets of a String, i read wikipedia and it mentions that
.....For the whole power set of S we get:
{ } = 000 (Binary) = 0 (Decimal)
{x} = 100 = 4
{y} = 010 = 2
{z} = 001 = 1
{x, y} = 110 = 6
{x, z} = 101 = 5
{y, z} = 011 = 3
{x, y, z} = 111 = 7
Is there a possible way to implement this through program and avoid recursive algorithm which uses string length?
What i understood so far is that, for a String of length n, we can run from 0 to 2^n - 1 and print characters for on bits.
What i couldn't get is how to map those on bits with the corresponding characters in the most optimized manner
PS : checked thread but couldnt understood this and c++ : Power set generated by bits
The idea is that a power set of a set of size n has exactly 2^n elements, exactly the same number as there are different binary numbers of length at most n.
Now all you have to do is create a mapping between the two and you don't need a recursive algorithm. Fortunately with binary numbers you have a real intuitive and natural mapping in that you just add a character at position j in the string to a subset if your loop variable has bit j set which you can easily do with getBit() I wrote there (you can inline it but for you I made a separate function for better readability).
P.S. As requested, more detailed explanation on the mapping:
If you have a recursive algorithm, your flow is given by how you traverse your data structure in the recursive calls. It is as such a very intuitive and natural way of solving many problems.
If you want to solve such a problem without recursion for whatever reason, for instance to use less time and memory, you have the difficult task of making this traversal explicit.
As we use a loop with a loop variable which assumes a certain set of values, we need to make sure to map each value of the loop variable, e.g. 42, to one element, in our case a subset of s, in a way that we have a bijective mapping, that is, we map to each subset exactly once. Because we have a set the order does not matter, so we just need whatever mapping that satisfies these requirements.
Now we look at a binary number, e.g. 42 = 32+8+2 and as such in binary with the position above:
543210
101010
We can thus map 42 to a subset as follows using the positions:
order the elements of the set s in any way you like but consistently (always the same in one program execution), we can in our case use the order in the string
add an element e_j if and only if the bit at position j is set (equal to 1).
As each number has at least one digit different from any other, we always get different subsets, and thus our mapping is injective (different input -> different output).
Our mapping is also valid, as the binary numbers we chose have at most the length equal to the size of our set so the bit positions can always be assigned to an element in the set. Combined with the fact that our set of inputs is chosen to have the same size (2^n) as the size of a power set, we can follow that it is in fact bijective.
import java.util.HashSet;
import java.util.Set;
public class PowerSet
{
static boolean getBit(int i, int pos) {return (i&1<<pos)>0;}
static Set<Set<Character>> powerSet(String s)
{
Set<Set<Character>> pow = new HashSet<>();
for(int i=0;i<(2<<s.length());i++)
{
Set<Character> subSet = new HashSet<>();
for(int j=0;j<s.length();j++)
{
if(getBit(i,j)) {subSet.add(s.charAt(j));}
}
pow.add(subSet);
}
return pow;
}
public static void main(String[] args)
{System.out.println(powerSet("xyz"));}
}
Here is easy way to do it (pseudo code) :-
for(int i=0;i<2^n;i++) {
char subset[];
int k = i;
int c = 0;
while(k>0) {
if(k%2==1) {
subset.add(string[c]);
}
k = k/2;
c++;
}
print subset;
}
Explanation :- The code divides number by 2 and calculates remainder which is used to convert number to binary form. Then as you know only selects index in string which has 1 at that bit number.

Stable mapping of an integer to a random number

I need a stable and fast one way mapping function of an integer to a random number.
By "stable" I mean that the same integer should always map to the same random number.
And by "random number" I actually mean "some number which behaves like random".
e.g.
1 -> 329423
2 -> -12398791234
3 -> -984
4 -> 42342435
...
If I had enough memory (and time) I would ideally use:
for( int i=Integer.MIN_VALUE; i<Integer.MAX_VALUE; i++ ){
map[i]=i;
}
shuffle( map );
I could use some secure hash function like MD5 or SHA but these are to slow for my purposes and I don't need any crypto/security properties.
I only need this in one way. So I will never have to translate the random number back to its integer.
Background: (For those who want to know more)
I'm planing to use this to invalidate a complete cache over a given amount of time. The invalidation is done "randomly" on access of the cache member with an increasing chance while time passes. I need this to be stable so that isValid( entry ) does not "flicker" and for consistent testing.
The input to this function will be the java hash of the key of the entry which typically is in the range of "1000"-"15000" (but can contain some other stuff, too) and comes in bulks.
The invalidation is done on the condition of:
elapsedTime / timeout * Integer.MAX_VALUE > abs( random( key.hashCode() ) )
EDIT: (this is to long for a comment so I put it here)
I tried gexicide's answer and it turns out this isn't random enough. Here is what I tried:
for( int i=0; i<12000; i++ ){
int hash = (""+i).hashCode();
Random rng = new Random( hash );
int random = rng.nextInt();
System.out.printf( "%05d, %08x, %08x\n", i, hash, random );
}
The output starts with:
00000, 00000030, bac2c591
00001, 00000031, babce6a4
00002, 00000032, bace836b
00003, 00000033, bac8a47e
00004, 00000034, baab49de
00005, 00000035, baa56af1
00006, 00000036, bab707b7
00007, 00000037, bab128ca
00008, 00000038, ba93ce2a
00009, 00000039, ba8def3d
00010, 0000061f, 98048199
and it goes on in this way.
I could use SecureRandom instead:
for( int i=0; i<12000; i++ ){
SecureRandom rng = new SecureRandom( (""+i).getBytes() );
int random = rng.nextInt();
System.out.printf( "%05d, %08x\n", i, random );
}
which indeed looks pretty random but this is not stable anymore and 10 times slower than the method above.
Although you never specified it as a requirement you'll probably want a full 1:1 mapping. This is because the number of possible input values is small. Any output that can occur for more than one input implies another output which can never happen at all. If you have output values which are impossible then you have a skewed distribution.
Of course, if your input is skewed then your output will be skewed anyway, and there's not much you can do about that.
Anyway; this makes it a unique int to int hash.
Simply apply a couple of trivial, independent 1:1 mapping functions until things are suitably distributed. You've already isolated one transform from the Random class, but I suggest mixing it with some other transforms like shifts and XORs to avoid individual weaknesses of different algorithms.
For example:
public static int mapInteger( int value ){
value *= 1664525;
value += 1013904223;
value ^= value >>> 12;
value ^= value << 25;
value ^= value >>> 27;
value *= 1103515245;
value += 12345;
return value;
}
If that's good enough then you can make it faster by deleting lines at random (I suggest you keep at least one multiply) until it's not good enough anymore, and then add the last deleted line back in.
Use a Random and seed it with your number:
Random generator = new Random(i);
return generator.nextInt();
As your testing exposes, the problem with this method is that such a seed creates a very poor random number in the first iteration. To increase the quality of the result, we need to run the random generator a few times; this will fill up the state of the random generator with pseudo-random values and will increase the quality of the following values.
To make sure that the random generator spreads the values enough, use it a few times before outputting the number. This should make the resulting number more pseudo-random:
Random generator = new Random(i);
for(int i = 0; i < 5; i++) generator.nextInt();
return generator.nextInt();
Try different values, maybe 5 is enough.
The answer of gexicide is the correct (and the most simple) one. Just one note:
Running this 1,000,000 times on my system takes about 70ms. (Which is pretty fast.)
But it involves at least two object creations and feeds the GC. It would be better
if this could be done on the stack and not using object creation at all.
Looking at the sources of Random class it shows that there is some code to make
it callable multiple times and to make it threadsafe which can be removed.
So I ended up with a reimplementation in one method:
public static int mapInteger( int value ){
// initial scramble
long seed = (value ^ multiplier) & mask;
// shuffle three times. This is like calling rng.nextInt() 3 times
seed = (seed * multiplier + addend) & mask;
seed = (seed * multiplier + addend) & mask;
seed = (seed * multiplier + addend) & mask;
// fit size
return (int)(seed >>> 16);
}
(multiplier, addend and mask are some constants used by Random)
Running this 1,000,000 times gives the same result but takes only 5ms and is therefor 10 times faster.
BTW: This happens to be another piece of code from The Old Man - again. See Donald Knuth,
The Art of Computer Programming, Volume 2, Section 3.2.1

Bit-wise efficient uniform random number generation

I recall reading about a method for efficiently using random bits in an article on a math-oriented website, but I can't seem to get the right keywords in Google to find it anymore, and it's not in my browser history.
The gist of the problem that was being asked was to take a sequence of random numbers in the domain [domainStart, domainEnd) and efficiently use the bits of the random number sequence to project uniformly into the range [rangeStart, rangeEnd). Both the domain and the range are integers (more correctly, longs and not Z). What's an algorithm to do this?
Implementation-wise, I have a function with this signature:
long doRead(InputStream in, long rangeStart, long rangeEnd);
in is based on a CSPRNG (fed by a hardware RNG, conditioned through SecureRandom) that I am required to use; the return value must be between rangeStart and rangeEnd, but the obvious implementation of this is wasteful:
long doRead(InputStream in, long rangeStart, long rangeEnd) {
long retVal = 0;
long range = rangeEnd - rangeStart;
// Fill until we get to range
for (int i = 0; (1 << (8 * i)) < range; i++) {
int in = 0;
do {
in = in.read();
// but be sure we don't exceed range
} while(retVal + (in << (8 * i)) >= range);
retVal += in << (8 * i);
}
return retVal + rangeStart;
}
I believe this is effectively the same idea as (rand() * (max - min)) + min, only we're discarding bits that push us over max. Rather than use a modulo operator which may incorrectly bias the results to the lower values, we discard those bits and try again. Since hitting the CSPRNG may trigger re-seeding (which can block the InputStream), I'd like to avoid wasting random bits. Henry points out that this code biases against 0 and 257; Banthar demonstrates it in an example.
First edit: Henry reminded me that summation invokes the Central Limit Theorem. I've fixed the code above to get around that problem.
Second edit: Mechanical snail suggested that I look at the source for Random.nextInt(). After reading it for a while, I realized that this problem is similar to the base conversion problem. See answer below.
Your algorithm produces biased results. Let's assume rangeStart=0 and rangeEnd=257. If first byte is greater than 0, that will be the result. If it's 0, the result will be either 0 or 256 with 50/50 probability. So 0 and 256 are twice less likely to be chosen than any other number.
I did a simple test to confirm this:
p(0)=0.001945
p(1)=0.003827
p(2)=0.003818
...
p(254)=0.003941
p(255)=0.003817
p(256)=0.001955
I think you need to do the same as java.util.Random.nextInt and discard the whole number, instead just the last byte.
After reading the source to Random.nextInt(), I realized that this problem is similar to the base conversion problem.
Rather than converting a single symbol at a time, it would be more effective to convert blocks of input symbol at a time through an accumulator "buffer" which is large enough to represent at least one symbol in the domain and in the range. The new code looks like this:
public int[] fromStream(InputStream input, int length, int rangeLow, int rangeHigh) throws IOException {
int[] outputBuffer = new int[length];
// buffer is initially 0, so there is only 1 possible state it can be in
int numStates = 1;
long buffer = 0;
int alphaLength = rangeLow - rangeHigh;
// Fill outputBuffer from 0 to length
for (int i = 0; i < length; i++) {
// Until buffer has sufficient data filled in from input to emit one symbol in the output alphabet, fill buffer.
fill:
while(numStates < alphaLength) {
// Shift buffer by 8 (*256) to mix in new data (of 8 bits)
buffer = buffer << 8 | input.read();
// Multiply by 256, as that's the number of states that we have possibly introduced
numStates = numStates << 8;
}
// spits out least significant symbol in alphaLength
outputBuffer[i] = (int) (rangeLow + (buffer % alphaLength));
// We have consumed the least significant portion of the input.
buffer = buffer / alphaLength;
// Track the number of states we've introduced into buffer
numStates = numStates / alphaLength;
}
return outputBuffer;
}
There is a fundamental difference between converting numbers between bases and this problem, however; in order to convert between bases, I think one needs to have enough information about the number to perform the calculation - successive divisions by the target base result in remainders which are used to construct the digits in the target alphabet. In this problem, I don't really need to know all that information, as long as I'm not biasing the data, which means I can do what I did in the loop labeled "fill."

Java double and working with really small values

I have to store the product of several probabilty values that are really low (for example, 1E-80). Using the primitive java double would result in zero because of the underflow. I don't want the value to go to zero because later on there will be a larger number (for example, 1E100) that will bring the values within the range that the double can handle.
So, I created a different class (MyDouble) myself that works on saving the base part and the exponent parts. When doing calculations, for example multiplication, I multiply the base parts, and add the exponents.
The program is fast with the primitive double type. However, when I use my own class (MyDouble) the program is really slow. I think this is because of the new objects that I have to create each time to create simple operations and the garbage collector has to do a lot of work when the objects are no longer needed.
My question is, is there a better way you think I can solve this problem? If not, is there a way so that I can speedup the program with my own class (MyDouble)?
[Note: taking the log and later taking the exponent does not solve my problem]
MyDouble class:
public class MyDouble {
public MyDouble(double base, int power){
this.base = base;
this.power = power;
}
public static MyDouble multiply(double... values) {
MyDouble returnMyDouble = new MyDouble(0);
double prodBase = 1;
int prodPower = 0;
for( double val : values) {
MyDouble ad = new MyDouble(val);
prodBase *= ad.base;
prodPower += ad.power;
}
String newBaseString = "" + prodBase;
String[] splitted = newBaseString.split("E");
double newBase = 0; int newPower = 0;
if(splitted.length == 2) {
newBase = Double.parseDouble(splitted[0]);
newPower = Integer.parseInt(splitted[1]);
} else {
newBase = Double.parseDouble(splitted[0]);
newPower = 0;
}
returnMyDouble.base = newBase;
returnMyDouble.power = newPower + prodPower;
return returnMyDouble;
}
}
The way this is solved is to work in log space---it trivialises the problem. When you say it doesn't work, can you give specific details of why? Probability underflow is a common issue in probabilistic models, and I don't think I've ever known it solved any other way.
Recall that log(a*b) is just log(a) + log(b). Similarly log(a/b) is log(a) - log(b). I assume since you're working with probabilities its multiplication and division that are causing the underflow issues; the drawback of log space is that you need to use special routines to calculate log(a+b), which I can direct you to if this is your issue.
So the simple answer is, work in log space, and re-exponentiate at the end to get a human-readable number.
You trying to parse strings each time you doing multiply. Why don't you calculate all values into some structure like real and exponential part as pre-calculation step and then create algorithms for multiplication, adding, subdivision, power and other.
Also you could add flag for big/small numbers. I think you will not use both 1e100 and 1e-100 in one calculation (so you could simplify some calculations) and you could improve calculation time for different pairs (large, large), (small, small), (large, small).
You can use
BigDecimal bd = BigDecimal.ONE.scaleByPowerOfTen(-309)
.multiply(BigDecimal.ONE.scaleByPowerOfTen(-300))
.multiply(BigDecimal.ONE.scaleByPowerOfTen(300));
System.out.println(bd);
prints
1E-309
Or if you use a log10 scale
double d = -309 + -300 + 300;
System.out.println("1E"+d);
prints
1E-309.0
Slowness might be because of the intermediate string objects which are created in split and string concats.
Try this:
/**
* value = base * 10 ^ power.
*/
public class MyDouble {
// Threshold values to determine whether given double is too small or not.
private static final double SMALL_EPSILON = 1e-8;
private static final double SMALL_EPSILON_MULTIPLIER = 1e8;
private static final int SMALL_EPSILON_POWER = 8;
private double myBase;
private int myPower;
public MyDouble(double base, int power){
myBase = base;
myPower = power;
}
public MyDouble(double base)
{
myBase = base;
myPower = 0;
adjustPower();
}
/**
* If base value is too small, increase the base by multiplying with some number and
* decrease the power accordingly.
* <p> E.g 0.000 000 000 001 * 10^1 => 0.0001 * 10^8
*/
private void adjustPower()
{
// Increase the base & decrease the power
// if given double value is less than threshold.
if (myBase < SMALL_EPSILON) {
myBase = myBase * SMALL_EPSILON_MULTIPLIER;
myPower -= SMALL_EPSILON_POWER;
}
}
/**
* This method multiplies given double and updates this object.
*/
public void multiply(MyDouble d)
{
myBase *= d.myBase;
myPower += d.myPower;
adjustPower();
}
/**
* This method multiplies given primitive double value with this object and update the
* base and power.
*/
public void multiply(double d)
{
multiply(new MyDouble(d));
}
#Override
public String toString()
{
return "Base:" + myBase + ", Power=" + myPower;
}
/**
* This method multiplies given double values and returns MyDouble object.
* It make sure that too small double values do not zero out the multiplication result.
*/
public static MyDouble multiply(double...values)
{
MyDouble result = new MyDouble(1);
for (int i=0; i<values.length; i++) {
result.multiply(values[i]);
}
return result;
}
public static void main(String[] args) {
MyDouble r = MyDouble.multiply(1e-80, 1e100);
System.out.println(r);
}
}
If this is still slow for your purpose, you can modify multiply() method to directly operate on primitive double instead of creating a MyDouble object.
I'm sure this will be a good deal slower than a double, but probably a large contributing factor would be the String manipulation. Could you get rid of that and calculate the power through arithmetic instead? Even recursive or iterative arithmetic might be faster than converting to String to grab bits of the number.
In a performance heavy application, you want to find a way to store basic information in primitives. In this case, perhaps you can split the bytes of a long or other variable in so that a fixed portion is the base.
Then, you can create custom methods the multiply long or Long as if they were a double. You grab the bits representing the base and exp, and truncate accordingly.
In some sense, you're re-inventing the wheel here, since you want byte code that efficiently performs the operation you're looking for.
edit:
If you want to stick with two variables, you can modify your code to simply take an array, which will be much lighter than objects. Additionally, you need to remove calls to any string parsing functions. Those are extremely slow.

Java convert hash to random string

I'm trying to develop a reduction function for use within a rainbow table generator.
The basic principle behind a reduction function is that it takes in a hash, performs some calculations, and returns a string of a certain length.
At the moment I'm using SHA1 hashes, and I need to return a string with a length of three. I need the string to be made up on any three random characters from:
abcdefghijklmnopqrstuvwxyz0123456789
The major problem I'm facing is that any reduction function I write, always returns strings that have already been generated. And a good reduction function will only return duplicate strings rarely.
Could anyone suggest any ideas on a way of accomplishing this? Or any suggestions at all on hash to string manipulation would be great.
Thanks in advance
Josh
So it sounds like you've got 20 digits of base 255 (the length of a SHA1 hash) that you need to map into three digits of base 36. I would simply make a BigInteger from the hash bytes, modulus 36^3, and return the string in base 36.
public static final BigInteger N36POW3 = new BigInteger(""+36*36*36));
public static String threeDigitBase36(byte[] bs) {
return new BigInteger(bs).mod(N36POW3).toString(36);
}
// ...
threeDigitBase36(sha1("foo")); // => "96b"
threeDigitBase36(sha1("bar")); // => "y4t"
threeDigitBase36(sha1("bas")); // => "p55"
threeDigitBase36(sha1("zip")); // => "ej8"
Of course there will be collisions, as when you map any space into a smaller one, but the entropy should be better than something even sillier than the above solution.
Applying the KISS principle:
An SHA is just a String
The JDK hashcode for String is "random enough"
Integer can render in any base
This single line of code does it:
public static String shortHash(String sha) {
return Integer.toString(sha.hashCode() & 0x7FFFFFFF, 36).substring(0, 3);
}
Note: The & 0x7FFFFFFF is to zero the sign bit (hash codes can be negative numbers, which would otherwise render with a leading minus sign).
Edit - Guaranteeing hash length
My original solution was naive - it didn't deal with the case when the int hash is less than 100 (base 36) - meaning it would print less than 3 chars. This code fixes that, while still keeping the value "random". It also avoids the substring() call, so performance should be better.
static int min = Integer.parseInt("100", 36);
static int range = Integer.parseInt("zzz", 36) - min;
public static String shortHash(String sha) {
return Integer.toString(min + (sha.hashCode() & 0x7FFFFFFF) % range, 36);
}
This code guarantees the final hash has 3 characters by forcing it to be between 100 and zzz - the lowest and highest 3-char hash in base 36, while still making it "random".

Categories

Resources