Shifting a Java BitSet - java

I am using a java.util.BitSet to store a dense vector of bits.
I want to implement an operation that shifts the bits right by 1, analogous to >>> on ints.
Is there a library function that shifts BitSets?
If not, is there a better way than the below?
public static void logicalRightShift(BitSet bs) {
for (int i = 0; (i = bs.nextSetBit(i)) >= 0;) {
// i is the first bit in a run of set bits.
// Set any bit to the left of the run.
if (i != 0) { bs.set(i - 1); }
// Now i is the index of the bit after the end of the run.
i = bs.nextClearBit(i); // nextClearBit never returns -1.
// Clear the last bit of the run.
bs.clear(i - 1);
// 0000111100000...
// a b
// i starts off the loop at a, and ends the loop at b.
// The mutations change the run to
// 0001111000000...
}
}

That should do the trick:
BitSet shifted = bs.get(1, bs.length());
It will give you a bitset equal to the orginial one, but without the lower-most bit.
EDIT:
To generalize this to n bits,
BitSet shifted = bs.get(n, Math.max(n, bs.length()));

An alternative which is probably more efficient would be to work with the underlying long[].
Use bitset.toLongArray() to get the underlying data. Shift those longs accordingly, then create a new BitSet via BitSet.valueOf(long[]) You'll have to be very careful shifting the underlying longs, as you will have to take the low order bit and shift it into the high order bit on the next long in the array.
This should let you use the bit shift operations native on your processor to move 64 bits at a time, as opposed to iterating through each one separately.
EDIT: Based on Louis Wasserman's comment. This is only available in Java 1.7 API. Didn't realize that when I wrote it.

Please find this code block where BitSet is "left-shifted"
/**
* Shift the BitSet to left.<br>
* For example : 0b10010 (=18) => 0b100100 (=36) (equivalent to multiplicate by 2)
* #param bitSet
* #return shifted bitSet
*/
public static BitSet leftShiftBitSet(BitSet bitSet) {
final long maskOfCarry = 0x8000000000000000L;
long[] aLong = bitSet.toLongArray();
boolean carry = false;
for (int i = 0; i < aLong.length; ++i) {
if (carry) {
carry = ((aLong[i] & maskOfCarry) != 0);
aLong[i] <<= 1;
++aLong[i];
} else {
carry = ((aLong[i] & maskOfCarry) != 0);
aLong[i] <<= 1;
}
}
if (carry) {
long[] tmp = new long[aLong.length + 1];
System.arraycopy(aLong, 0, tmp, 0, aLong.length);
++tmp[aLong.length];
aLong = tmp;
}
return BitSet.valueOf(aLong);
}

You can use BigInteger instead of BitSet. BigInteger already has ShiftRight and ShiftLeft.

These functions mimic the << and >>> operators, respectively.
/**
* Shifts a BitSet n digits to the left. For example, 0b0110101 with n=2 becomes 0b10101.
*
* #param bits
* #param n the shift distance.
* #return
*/
public static BitSet shiftLeft(BitSet bits, int n) {
if (n < 0)
throw new IllegalArgumentException("'n' must be >= 0");
if (n >= 64)
throw new IllegalArgumentException("'n' must be < 64");
long[] words = bits.toLongArray();
// Do the shift
for (int i = 0; i < words.length - 1; i++) {
words[i] >>>= n; // Shift current word
words[i] |= words[i + 1] << (64 - n); // Do the carry
}
words[words.length - 1] >>>= n; // shift [words.length-1] separately, since no carry
return BitSet.valueOf(words);
}
/**
* Shifts a BitSet n digits to the right. For example, 0b0110101 with n=2 becomes 0b000110101.
*
* #param bits
* #param n the shift distance.
* #return
*/
public static BitSet shiftRight(BitSet bits, int n) {
if (n < 0)
throw new IllegalArgumentException("'n' must be >= 0");
if (n >= 64)
throw new IllegalArgumentException("'n' must be < 64");
long[] words = bits.toLongArray();
// Expand array if there will be carry bits
if (words[words.length - 1] >>> (64 - n) > 0) {
long[] tmp = new long[words.length + 1];
System.arraycopy(words, 0, tmp, 0, words.length);
words = tmp;
}
// Do the shift
for (int i = words.length - 1; i > 0; i--) {
words[i] <<= n; // Shift current word
words[i] |= words[i - 1] >>> (64 - n); // Do the carry
}
words[0] <<= n; // shift [0] separately, since no carry
return BitSet.valueOf(words);
}

You can look at the BitSet toLongArray and the valueOf(long[]).
Basically get the long array, shift the longs and construct a new BitSet from the shifted array.

In order to achieve better performance you can extend java.util.BitSet implementation and avoid unnecessary array copying. Here is implementation (I've basically reused Jeff Piersol implementation):
package first.specific.structure;
import java.lang.reflect.Field;
import java.util.BitSet;
public class BitSetMut extends BitSet {
private long[] words;
private static Field wordsField;
static {
try {
wordsField = BitSet.class.getDeclaredField("words");
wordsField.setAccessible(true);
} catch (NoSuchFieldException e) {
throw new IllegalStateException(e);
}
}
public BitSetMut(final int regLength) {
super(regLength);
try {
words = (long[]) wordsField.get(this);
} catch (IllegalAccessException e) {
throw new IllegalStateException(e);
}
}
public void shiftRight(int n) {
if (n < 0)
throw new IllegalArgumentException("'n' must be >= 0");
if (n >= 64)
throw new IllegalArgumentException("'n' must be < 64");
if (words.length > 0) {
ensureCapacity(n);
// Do the shift
for (int i = words.length - 1; i > 0; i--) {
words[i] <<= n; // Shift current word
words[i] |= words[i - 1] >>> (64 - n); // Do the carry
}
words[0] <<= n; // shift [0] separately, since no carry
// recalculateWordInUse() is unnecessary
}
}
private void ensureCapacity(final int n) {
if (words[words.length - 1] >>> n > 0) {
long[] tmp = new long[words.length + 3];
System.arraycopy(words, 0, tmp, 0, words.length);
words = tmp;
try {
wordsField.set(this, tmp);
} catch (IllegalAccessException e) {
throw new IllegalStateException(e);
}
}
}
}

With java SE8, it can be achieved more concise way:
BitSet b = new BitSet();
b.set(1, 3);
BitSet shifted = BitSet.valueOf(Arrays.stream(
b.toLongArray()).map(v -> v << 1).toArray());
I was trying to figure out how to use LongBuffer to do so but not quite got it to work. Hopefully, someone who is familiar with low level programming can point out a solution.
Thanks in advance!!!

Related

How can I create a stream of bits from a byte array? [duplicate]

How can i iterate bits in a byte array?
You'd have to write your own implementation of Iterable<Boolean> which took an array of bytes, and then created Iterator<Boolean> values which remembered the current index into the byte array and the current index within the current byte. Then a utility method like this would come in handy:
private static Boolean isBitSet(byte b, int bit)
{
return (b & (1 << bit)) != 0;
}
(where bit ranges from 0 to 7). Each time next() was called you'd have to increment your bit index within the current byte, and increment the byte index within byte array if you reached "the 9th bit".
It's not really hard - but a bit of a pain. Let me know if you'd like a sample implementation...
public class ByteArrayBitIterable implements Iterable<Boolean> {
private final byte[] array;
public ByteArrayBitIterable(byte[] array) {
this.array = array;
}
public Iterator<Boolean> iterator() {
return new Iterator<Boolean>() {
private int bitIndex = 0;
private int arrayIndex = 0;
public boolean hasNext() {
return (arrayIndex < array.length) && (bitIndex < 8);
}
public Boolean next() {
Boolean val = (array[arrayIndex] >> (7 - bitIndex) & 1) == 1;
bitIndex++;
if (bitIndex == 8) {
bitIndex = 0;
arrayIndex++;
}
return val;
}
public void remove() {
throw new UnsupportedOperationException();
}
};
}
public static void main(String[] a) {
ByteArrayBitIterable test = new ByteArrayBitIterable(
new byte[]{(byte)0xAA, (byte)0xAA});
for (boolean b : test)
System.out.println(b);
}
}
Original:
for (int i = 0; i < byteArray.Length; i++)
{
byte b = byteArray[i];
byte mask = 0x01;
for (int j = 0; j < 8; j++)
{
bool value = b & mask;
mask << 1;
}
}
Or using Java idioms
for (byte b : byteArray ) {
for ( int mask = 0x01; mask != 0x100; mask <<= 1 ) {
boolean value = ( b & mask ) != 0;
}
}
An alternative would be to use a BitInputStream like the one you can find here and write code like this:
BitInputStream bin = new BitInputStream(new ByteArrayInputStream(bytes));
while(true){
int bit = bin.readBit();
// do something
}
bin.close();
(Note: Code doesn't contain EOFException or IOException handling for brevity.)
But I'd go with Jon Skeets variant and do it on my own.
I needed some bit streaming in my application. Here you can find my BitArray implementation. It is not a real iterator pattern but you can ask for 1-32 bits from the array in a streaming way. There is also an alternate implementation called BitReader later in the file.
I know, probably not the "coolest" way to do it, but you can extract each bit with the following code.
int n = 156;
String bin = Integer.toBinaryString(n);
System.out.println(bin);
char arr[] = bin.toCharArray();
for(int i = 0; i < arr.length; ++i) {
System.out.println("Bit number " + (i + 1) + " = " + arr[i]);
}
10011100
Bit number 1 = 1
Bit number 2 = 0
Bit number 3 = 0
Bit number 4 = 1
Bit number 5 = 1
Bit number 6 = 1
Bit number 7 = 0
Bit number 8 = 0
You can iterate through the byte array, and for each byte use the bitwise operators to iterate though its bits.
Alternatively, you can use BitSet for this:
byte[] bytes=...;
BitSet bitSet=BitSet.valueOf(bytes);
for(int i=0;i<bitSet.length();i++){
boolean bit=bitSet.get(i);
//use your bit
}

Bit shuffling to change encoding from little endian

So in the program I need to read an number from the user which needs to be changed from little endian encoding to whatever encoding the user wants to change it to. The encoding entered by the user is just a 4 digits number which just means which byte should be where after the encoding. e.g. 4321 means put the 4th byte first followed by the 3rd and so on. the encoding can take other form such as 3214 etc.
This is my code, would really appreciate if someone point out where I am missing out.
import java.util.Scanner;
class encoding {
public static void main(String[] args) {
Scanner sc = new Scanner(System.in);
int n = sc.nextInt();
String byteOrder = sc.next();
long[] bitMask = { // little endian
Long.parseLong("11111111000000000000000000000000", 2),
Long.parseLong("00000000111111110000000000000000", 2),
Long.parseLong("00000000000000001111111100000000", 2),
Long.parseLong("00000000000000000000000011111111", 2)
};
int[] bytes = {
(int)(bitMask[0] & n),
(int)(bitMask[1] & n),
(int)(bitMask[2] & n),
(int)(bitMask[3] & n)
};
int result = 0;
shuffleBytes(bytes, byteOrder);
for (int i = 0; i < 4; i++) {
bytes[i] = bytes[i] << (i * 8);
result |= bytes[i];
}
System.out.println(result);
}
static void shuffleBytes(int[] bytes, String encoding) {
for (int i = 0; i < 4; i++) {
int index = Integer.parseInt(encoding.substring(i, i+1))-1;
int copy = bytes[i];
bytes[i] = bytes[index];
bytes[index] = copy;
}
}
}
Fixing your current solution
There are two problems:
1. Forgot to right-align bytes
In ...
int[] bytes = {
(int)(bitMask[0] & n),
(int)(bitMask[1] & n),
(int)(bitMask[2] & n),
(int)(bitMask[3] & n)
};
... you forgot to shift each "byte" to the right. As a result, you end up with a list of "bytes" of the form 0x……000000, 0x00……0000, 0x0000……00, 0x000000……. This is not a problem yet, but after shuffleBytes you shift each of these entries again using bytes[i] = bytes[i] << (i * 8);. As a result, the relevant parts (__) end up at a completely different spot or are shifted completely out of the integer.
To fix this, shift each (int)(bitMask[…] & n) to the right:
int[] bytes = {
(int)(bitMask[0] & n) >> (3*8),
(int)(bitMask[1] & n) >> (2*8),
(int)(bitMask[2] & n) >> (1*8),
(int)(bitMask[3] & n) >> (0*8)
};
2. Swapping more than once
In ...
static void shuffleBytes(int[] bytes, String encoding) {
for (int i = 0; i < 4; i++) {
int index = Integer.parseInt(encoding.substring(i, i+1))-1;
int copy = bytes[i];
bytes[i] = bytes[index];
bytes[index] = copy;
}
}
... you swap some bytes multiple times because you operate in-place. To understand what happens consider the following minimal example where we want to swap two bytes using order = "21". We inspect the variables before/after each iteration of the for loop.
The original input is bytes = {x, y} and order = "21"
We moved bytes[0] to bytes[1]. Now we have bytes = {y, x}.
But we are not finished yet. The loop continues and moves bytes[1] to bytes[0]. You assumed that bytes[1] would still be y at this point. However, because of the previous iteration this entry now holds x instead. Therefore, the result is bytes = {x, y}.
Here nothing changed, but for more entries you might also end up with something that is neither the original order nor the expected output order.
The easiest way to fix this is to write the result into a new array:
static int[] shuffleBytes(int[] bytes, String encoding) {
int[] result = new int[bytes.length];
for (int i = 0; i < 4; i++) {
int index = Integer.parseInt(encoding.substring(i, i+1))-1;
result[index] = bytes[i];
}
return result; // also adapt main() to use this return value
}
Alternative Solution
Even though you could fix your solution as described above I'm not too happy with it. Therefore, I propose this alternative solution which is cleaner, shorter, and more efficient.
import java.util.Scanner;
public class Encoding {
public static void main(String[] args) {
Scanner sc = new Scanner(System.in);
int input = sc.nextInt();
System.out.format("input = 0x%08x = %1$d%n", input);
String newOrder = sc.next();
int output = reorder(input, newOrder);
System.out.format("output = 0x%08x = %1$d%n", output);
}
/** #param newOrder permutation of "1234" */
static int reorder(int input, String newOrder) {
int output = 0;
for (char byte1Based : newOrder.toCharArray()) {
output <<= 8;
int shift = (byte1Based - '1') * 8;
output |= ((0xFF << shift) & input) >> shift;
}
return output;
}
}

java - how to create and manipulate a bit array with length of 10 million bits

I just came across a problem; it was easy to solve in pseudo code, but when I started coding it in java; I started to realize I didn't know where to start...
Here is what I need to do:
I need a bit array of size 10 million (bits) (let's call it A).
I need to be able to set the elements in this array to 1 or 0 (A[99000]=1).
I need to iterate through the 10 million elements.
The "proper" way in Java is to use the already-existing BitSet class pointed out by Hunter McMillen. If you're figuring out how a large bit-array is managed purely for the purpose of thinking through an interesting problem, then calculating the position of a bit in an array of bytes is just basic modular arithmetic.
public class BitArray {
private static final int ALL_ONES = 0xFFFFFFFF;
private static final int WORD_SIZE = 32;
private int bits[] = null;
public BitArray(int size) {
bits = new int[size / WORD_SIZE + (size % WORD_SIZE == 0 ? 0 : 1)];
}
public boolean getBit(int pos) {
return (bits[pos / WORD_SIZE] & (1 << (pos % WORD_SIZE))) != 0;
}
public void setBit(int pos, boolean b) {
int word = bits[pos / WORD_SIZE];
int posBit = 1 << (pos % WORD_SIZE);
if (b) {
word |= posBit;
} else {
word &= (ALL_ONES - posBit);
}
bits[pos / WORD_SIZE] = word;
}
}
Use BitSet (as Hunter McMillen already pointed out in a comment). You can easily get and set bits. To iterate just use a normal for loop.
Here is a more optimized implementation of phatfingers 'BitArray'
class BitArray {
private static final int MASK = 63;
private final long len;
private long bits[] = null;
public BitArray(long size) {
if ((((size-1)>>6) + 1) > 2147483647) {
throw new IllegalArgumentException(
"Field size to large, max size = 137438953408");
}else if (size < 1) {
throw new IllegalArgumentException(
"Field size to small, min size = 1");
}
len = size;
bits = new long[(int) (((size-1)>>6) + 1)];
}
public boolean getBit(long pos) {
return (bits[(int)(pos>>6)] & (1L << (pos&MASK))) != 0;
}
public void setBit(long pos, boolean b) {
if (getBit(pos) != b) { bits[(int)(pos>>6)] ^= (1L << (pos&MASK)); }
}
public long getLength() {
return len;
}
}
Since we use fields of 64 we extend the maximum size to 137438953408-bits which is roughly what fits in 16GB of ram. Additionally we use masks and bit shifts instead of division and modulo operations the reducing the computation time. The improvement is quite substantial.
byte[] A = new byte[10000000];
A[99000] = 1;
for(int i = 0; i < A.length; i++) {
//do something
}
If you really want bits, you can use boolean and let true = 1, and false = 0.
boolean[] A = new boolean[10000000];
//etc

Java: simplest integer hash

I need a quick hash function for integers:
int hash(int n) { return ...; }
Is there something that exists already in Java?
The minimal properties that I need are:
hash(n) & 1 does not appear periodic when used with a bunch of consecutive values of n.
hash(n) & 1 is approximately equally likely to be 0 or 1.
HashMap, as well as Guava's hash-based utilities, use the following method on hashCode() results to improve bit distributions and defend against weaker hash functions:
/*
* This method was written by Doug Lea with assistance from members of JCP
* JSR-166 Expert Group and released to the public domain, as explained at
* http://creativecommons.org/licenses/publicdomain
*
* As of 2010/06/11, this method is identical to the (package private) hash
* method in OpenJDK 7's java.util.HashMap class.
*/
static int smear(int hashCode) {
hashCode ^= (hashCode >>> 20) ^ (hashCode >>> 12);
return hashCode ^ (hashCode >>> 7) ^ (hashCode >>> 4);
}
So, I read this question, thought hmm this is a pretty math-y question, it's probably out of my league. Then, I ended up spending so much time thinking about it that I actually believe I've got the answer: No function can satisfy the criteria that f(n) & 1 is non-periodic for consecutive values of n.
Hopefully someone will tell me how ridiculous my reasoning is, but until then I believe it's correct.
Here goes: Any binary integer n can be represented as either 1...0 or 1...1, and only the least significant bit of that bitmap will affect the result of n & 1. Further, the next consecutive integer n + 1 will always contain the opposite least significant bit. So, clearly any series of consecutive integers will exhibit a period of 2 when passed to the function n & 1. So then, is there any function f(n) that will sufficiently distribute the series of consecutive integers such that periodicity is eliminated?
Any function f(n) = n + c fails, as c must end in either 0 or 1, so the LSB will either flip or stay the same depending on the constant chosen.
The above also eliminates subtraction for all trivial cases, but I have not taken the time to analyze the carry behavior yet, so there may be a crack here.
Any function f(n) = c*n fails, as the LSB will always be 0 if c ends in 0 and always be equal to the LSB of n if c ends in 1.
Any function f(n) = n^c fails, by similar reasoning. A power function would always have the same LSB as n.
Any function f(n) = c^n fails, for the same reason.
Division and modulus were a bit less intuitive to me, but basically, the LSB of either option ends up being determined by a subtraction (already ruled out). The modulus will also obviously have a period equal to the divisor.
Unfortunately, I don't have the rigor necessary to prove this, but I believe any combination of the above operations will ultimately fail as well. This leads me to believe that we can rule out any transcendental function, because these are implemented with polynomials (Taylor series? not a terminology guy).
Finally, I held out hope on the train ride home that counting the bits would work; however, this is actually a periodic function as well. The way I thought about it was, imagine taking the sum of the digits of any decimal number. That sum obviously would run from 0 through 9, then drop to 1, run from 1 to 10, then drop to 2... It has a period, the range just keeps shifting higher the higher we count. We can actually do the same thing for the sum of the binary digits, in which case we get something like: 0,1,1,2,2,....5,5,6,6,7,7,8,8....
Did I leave anything out?
TL;DR I don't think your question has an answer.
[SO decided to convert my "trivial answer" to comment. Trying to add little text to it to see if it can be fooled]
Unless you need the ranger of hashing function to be wider..
The NumberOfSetBits function seems to vary quite a lot more then the hashCode, and as such seems more appropriate for your needs. Turns out there is already a fairly efficient algorithm on SO.
See Best algorithm to count the number of set bits in a 32-bit integer.
I did some experimentation (see test program below); computation of 2^n in Galois fields, and floor(A*sin(n)) both did very well to produce a sequence of "random" bits. I tried multiplicative congruential random number generators and some algebra and CRC (which is analogous of k*n in Galois fields), none of which did well.
The floor(A*sin(n)) approach is the simplest and quickest; the 2^n calculation in GF32 takes approx 64 multiplies and 1024 XORs worstcase, but the periodicity of output bits is extremely well-understood in the context of linear-feedback shift registers.
package com.example.math;
public class QuickHash {
interface Hasher
{
public int hash(int n);
}
static class MultiplicativeHasher1 implements Hasher
{
/* multiplicative random number generator
* from L'Ecuyer is x[n+1] = 1223106847 x[n] mod (2^32-5)
* http://dimsboiv.uqac.ca/Cours/C2012/8INF802_Hiv12/ref/paper/RNG/TableLecuyer.pdf
*/
final static long a = 1223106847L;
final static long m = (1L << 32)-5;
/*
* iterative step towards computing mod m
* (j*(2^32)+k) mod (2^32-5)
* = (j*(2^32-5)+j*5+k) mod (2^32-5)
* = (j*5+k) mod (2^32-5)
* repeat twice to get a number between 0 and 2^31+24
*/
private long quickmod(long x)
{
long j = x >>> 32;
long k = x & 0xffffffffL;
return j*5+k;
}
// treat n as unsigned before computation
#Override public int hash(int n) {
long h = a*(n&0xffffffffL);
long h2 = quickmod(quickmod(h));
return (int) (h2 >= m ? (h2-m) : h2);
}
#Override public String toString() { return getClass().getSimpleName(); }
}
/**
* computes (2^n) mod P where P is the polynomial in GF2
* with coefficients 2^(k+1) represented by the bits k=31:0 in "poly";
* coefficient 2^0 is always 1
*/
static class GF32Hasher implements Hasher
{
static final public GF32Hasher CRC32 = new GF32Hasher(0x82608EDB, 32);
final private int poly;
final private int ofs;
public GF32Hasher(int poly, int ofs) {
this.ofs = ofs;
this.poly = poly;
}
static private long uint(int x) { return x&0xffffffffL; }
// modulo GF2 via repeated subtraction
int mod(long n) {
long rem = n;
long q = uint(this.poly);
q = (q << 32) | (1L << 31);
long bitmask = 1L << 63;
for (int i = 0; i < 32; ++i, bitmask >>>= 1, q >>>= 1)
{
if ((rem & bitmask) != 0)
rem ^= q;
}
return (int) rem;
}
int mul(int x, int y)
{
return mod(uint(x)*uint(y));
}
int pow2(int n) {
// compute 2^n mod P using repeated squaring
int y = 1;
int x = 2;
while (n > 0)
{
if ((n&1) != 0)
y = mul(y,x);
x = mul(x,x);
n = n >>> 1;
}
return y;
}
#Override public int hash(int n) {
return pow2(n+this.ofs);
}
#Override public String toString() {
return String.format("GF32[%08x, ofs=%d]", this.poly, this.ofs);
}
}
static class QuickHasher implements Hasher
{
#Override public int hash(int n) {
return (int) ((131111L*n)^n^(1973*n)%7919);
}
#Override public String toString() { return getClass().getSimpleName(); }
}
// adapted from http://www.w3.org/TR/PNG-CRCAppendix.html
static class CRC32TableHasher implements Hasher
{
final private int table[];
static final private int polyval = 0xedb88320;
public CRC32TableHasher()
{
this.table = make_table();
}
/* Make the table for a fast CRC. */
static public int[] make_table()
{
int[] table = new int[256];
int c;
int n, k;
for (n = 0; n < 256; n++) {
c = n;
for (k = 0; k < 8; k++) {
if ((c & 1) != 0)
c = polyval ^ (c >>> 1);
else
c = c >>> 1;
}
table[n] = (int) c;
}
return table;
}
public int iterate(int state, int i)
{
return this.table[(state ^ i) & 0xff] ^ (state >>> 8);
}
#Override public int hash(int n) {
int h = -1;
h = iterate(h, n >>> 24);
h = iterate(h, n >>> 16);
h = iterate(h, n >>> 8);
h = iterate(h, n);
return h ^ -1;
}
#Override public String toString() { return getClass().getSimpleName(); }
}
static class TrigHasher implements Hasher
{
#Override public String toString() { return getClass().getSimpleName(); }
#Override public int hash(int n) {
double s = Math.sin(n);
return (int) Math.floor((1<<31)*s);
}
}
private static void test(Hasher hasher) {
System.out.println(hasher+":");
for (int i = 0; i < 64; ++i)
{
int h = hasher.hash(i);
System.out.println(String.format("%08x -> %08x %%2 = %d",
i,h,(h&1)));
}
for (int i = 0; i < 256; ++i)
{
System.out.print(hasher.hash(i) & 1);
}
System.out.println();
analyzeBits(hasher);
}
private static void analyzeBits(Hasher hasher) {
final int N = 65536;
final int maxrunlength=32;
int[][] runs = {new int[maxrunlength], new int[maxrunlength]};
int[] count = new int[2];
int prev = -1;
System.out.println("Run length test of "+N+" bits");
for (int i = 0; i < maxrunlength; ++i)
{
runs[0][i] = 0;
runs[1][i] = 0;
}
int runlength_minus1 = 0;
for (int i = 0; i < N; ++i)
{
int b = hasher.hash(i) & 0x1;
count[b]++;
if (b == prev)
++runlength_minus1;
else if (i > 0)
{
++runs[prev][runlength_minus1];
runlength_minus1 = 0;
}
prev = b;
}
++runs[prev][runlength_minus1];
System.out.println(String.format("%d zeros, %d ones", count[0], count[1]));
for (int i = 0; i < maxrunlength; ++i)
{
System.out.println(String.format("%d runs of %d zeros, %d runs of %d ones", runs[0][i], i+1, runs[1][i], i+1));
}
}
public static void main(String[] args) {
Hasher[] hashers = {
new MultiplicativeHasher1(),
GF32Hasher.CRC32,
new QuickHasher(),
new CRC32TableHasher(),
new TrigHasher()
};
for (Hasher hasher : hashers)
{
test(hasher);
}
}
}
The simplest hash for int value is the int value.
See Java Integer class
public int hashCode()
public static int hashCode(int value)
Returns:
a hash code value for this object, equal to the primitive int value represented by this Integer object.

Modifying Levenshtein Distance algorithm to not calculate all distances

I'm working on a fuzzy search implementation and as part of the implementation, we're using Apache's StringUtils.getLevenshteinDistance. At the moment, we're going for a specific maxmimum average response time for our fuzzy search. After various enhancements and with some profiling, the place where the most time is spent is calculating the Levenshtein distance. It takes up roughly 80-90% of the total time on search strings three letters or more.
Now, I know there are some limitations to what can be done here, but I've read on previous SO questions and on the Wikipedia link for LD that if one is willing limit the threshold to a set maximum distance, that could help curb the time spent on the algorithm, but I'm not sure how to do this exactly.
If we are only interested in the
distance if it is smaller than a
threshold k, then it suffices to
compute a diagonal stripe of width
2k+1 in the matrix. In this way, the
algorithm can be run in O(kl) time,
where l is the length of the shortest
string.[3]
Below you will see the original LH code from StringUtils. After that is my modification. I'm trying to basically calculate the distances of a set length from the i,j diagonal (so, in my example, two diagonals above and below the i,j diagonal). However, this can't be correct as I've done it. For example, on the highest diagonal, it's always going to choose the cell value directly above, which will be 0. If anyone could show me how to make this functional as I've described, or some general advice on how to make it so, it would be greatly appreciated.
public static int getLevenshteinDistance(String s, String t) {
if (s == null || t == null) {
throw new IllegalArgumentException("Strings must not be null");
}
int n = s.length(); // length of s
int m = t.length(); // length of t
if (n == 0) {
return m;
} else if (m == 0) {
return n;
}
if (n > m) {
// swap the input strings to consume less memory
String tmp = s;
s = t;
t = tmp;
n = m;
m = t.length();
}
int p[] = new int[n+1]; //'previous' cost array, horizontally
int d[] = new int[n+1]; // cost array, horizontally
int _d[]; //placeholder to assist in swapping p and d
// indexes into strings s and t
int i; // iterates through s
int j; // iterates through t
char t_j; // jth character of t
int cost; // cost
for (i = 0; i<=n; i++) {
p[i] = i;
}
for (j = 1; j<=m; j++) {
t_j = t.charAt(j-1);
d[0] = j;
for (i=1; i<=n; i++) {
cost = s.charAt(i-1)==t_j ? 0 : 1;
// minimum of cell to the left+1, to the top+1, diagonally left and up +cost
d[i] = Math.min(Math.min(d[i-1]+1, p[i]+1), p[i-1]+cost);
}
// copy current distance counts to 'previous row' distance counts
_d = p;
p = d;
d = _d;
}
// our last action in the above loop was to switch d and p, so p now
// actually has the most recent cost counts
return p[n];
}
My modifications (only to the for loops):
for (j = 1; j<=m; j++) {
t_j = t.charAt(j-1);
d[0] = j;
int k = Math.max(j-2, 1);
for (i = k; i <= Math.min(j+2, n); i++) {
cost = s.charAt(i-1)==t_j ? 0 : 1;
// minimum of cell to the left+1, to the top+1, diagonally left and up +cost
d[i] = Math.min(Math.min(d[i-1]+1, p[i]+1), p[i-1]+cost);
}
// copy current distance counts to 'previous row' distance counts
_d = p;
p = d;
d = _d;
}
The issue with implementing the window is dealing with the value to the left of the first entry and above the last entry in each row.
One way is to start the values you initially fill in at 1 instead of 0, then just ignore any 0s that you encounter. You'll have to subtract 1 from your final answer.
Another way is to fill the entries left of first and above last with high values so the minimum check will never pick them. That's the way I chose when I had to implement it the other day:
public static int levenshtein(String s, String t, int threshold) {
int slen = s.length();
int tlen = t.length();
// swap so the smaller string is t; this reduces the memory usage
// of our buffers
if(tlen > slen) {
String stmp = s;
s = t;
t = stmp;
int itmp = slen;
slen = tlen;
tlen = itmp;
}
// p is the previous and d is the current distance array; dtmp is used in swaps
int[] p = new int[tlen + 1];
int[] d = new int[tlen + 1];
int[] dtmp;
// the values necessary for our threshold are written; the ones after
// must be filled with large integers since the tailing member of the threshold
// window in the bottom array will run min across them
int n = 0;
for(; n < Math.min(p.length, threshold + 1); ++n)
p[n] = n;
Arrays.fill(p, n, p.length, Integer.MAX_VALUE);
Arrays.fill(d, Integer.MAX_VALUE);
// this is the core of the Levenshtein edit distance algorithm
// instead of actually building the matrix, two arrays are swapped back and forth
// the threshold limits the amount of entries that need to be computed if we're
// looking for a match within a set distance
for(int row = 1; row < s.length()+1; ++row) {
char schar = s.charAt(row-1);
d[0] = row;
// set up our threshold window
int min = Math.max(1, row - threshold);
int max = Math.min(d.length, row + threshold + 1);
// since we're reusing arrays, we need to be sure to wipe the value left of the
// starting index; we don't have to worry about the value above the ending index
// as the arrays were initially filled with large integers and we progress to the right
if(min > 1)
d[min-1] = Integer.MAX_VALUE;
for(int col = min; col < max; ++col) {
if(schar == t.charAt(col-1))
d[col] = p[col-1];
else
// min of: diagonal, left, up
d[col] = Math.min(p[col-1], Math.min(d[col-1], p[col])) + 1;
}
// swap our arrays
dtmp = p;
p = d;
d = dtmp;
}
if(p[tlen] == Integer.MAX_VALUE)
return -1;
return p[tlen];
}
I've written about Levenshtein automata, which are one way to do this sort of check in O(n) time before, here. The source code samples are in Python, but the explanations should be helpful, and the referenced papers provide more details.
According to "Gusfield, Dan (1997). Algorithms on strings, trees, and sequences: computer science and computational biology" (page 264) you should ignore zeros.
Here someone answers a very similar question:
Cite:
I've done it a number of times. The way I do it is with a recursive depth-first tree-walk of the game tree of possible changes. There is a budget k of changes, that I use to prune the tree. With that routine in hand, first I run it with k=0, then k=1, then k=2 until I either get a hit or I don't want to go any higher.
char* a = /* string 1 */;
char* b = /* string 2 */;
int na = strlen(a);
int nb = strlen(b);
bool walk(int ia, int ib, int k){
/* if the budget is exhausted, prune the search */
if (k < 0) return false;
/* if at end of both strings we have a match */
if (ia == na && ib == nb) return true;
/* if the first characters match, continue walking with no reduction in budget */
if (ia < na && ib < nb && a[ia] == b[ib] && walk(ia+1, ib+1, k)) return true;
/* if the first characters don't match, assume there is a 1-character replacement */
if (ia < na && ib < nb && a[ia] != b[ib] && walk(ia+1, ib+1, k-1)) return true;
/* try assuming there is an extra character in a */
if (ia < na && walk(ia+1, ib, k-1)) return true;
/* try assuming there is an extra character in b */
if (ib < nb && walk(ia, ib+1, k-1)) return true;
/* if none of those worked, I give up */
return false;
}
just the main part, more code in the original
I used the original code and places this just before the end of the j for loop:
if (p[n] > s.length() + 5)
break;
The +5 is arbitrary but for our purposes, if the distances is the query length plus five (or whatever number we settle upon), it doesn't really matter what is returned because we consider the match as simply being too different. It does cut down on things a bit. Still, pretty sure this isn't the idea that the Wiki statement was talking about, if anyone understands that better.
Apache Commons Lang 3.4 has this implementation:
/**
* <p>Find the Levenshtein distance between two Strings if it's less than or equal to a given
* threshold.</p>
*
* <p>This is the number of changes needed to change one String into
* another, where each change is a single character modification (deletion,
* insertion or substitution).</p>
*
* <p>This implementation follows from Algorithms on Strings, Trees and Sequences by Dan Gusfield
* and Chas Emerick's implementation of the Levenshtein distance algorithm from
* http://www.merriampark.com/ld.htm</p>
*
* <pre>
* StringUtils.getLevenshteinDistance(null, *, *) = IllegalArgumentException
* StringUtils.getLevenshteinDistance(*, null, *) = IllegalArgumentException
* StringUtils.getLevenshteinDistance(*, *, -1) = IllegalArgumentException
* StringUtils.getLevenshteinDistance("","", 0) = 0
* StringUtils.getLevenshteinDistance("aaapppp", "", 8) = 7
* StringUtils.getLevenshteinDistance("aaapppp", "", 7) = 7
* StringUtils.getLevenshteinDistance("aaapppp", "", 6)) = -1
* StringUtils.getLevenshteinDistance("elephant", "hippo", 7) = 7
* StringUtils.getLevenshteinDistance("elephant", "hippo", 6) = -1
* StringUtils.getLevenshteinDistance("hippo", "elephant", 7) = 7
* StringUtils.getLevenshteinDistance("hippo", "elephant", 6) = -1
* </pre>
*
* #param s the first String, must not be null
* #param t the second String, must not be null
* #param threshold the target threshold, must not be negative
* #return result distance, or {#code -1} if the distance would be greater than the threshold
* #throws IllegalArgumentException if either String input {#code null} or negative threshold
*/
public static int getLevenshteinDistance(CharSequence s, CharSequence t, final int threshold) {
if (s == null || t == null) {
throw new IllegalArgumentException("Strings must not be null");
}
if (threshold < 0) {
throw new IllegalArgumentException("Threshold must not be negative");
}
/*
This implementation only computes the distance if it's less than or equal to the
threshold value, returning -1 if it's greater. The advantage is performance: unbounded
distance is O(nm), but a bound of k allows us to reduce it to O(km) time by only
computing a diagonal stripe of width 2k + 1 of the cost table.
It is also possible to use this to compute the unbounded Levenshtein distance by starting
the threshold at 1 and doubling each time until the distance is found; this is O(dm), where
d is the distance.
One subtlety comes from needing to ignore entries on the border of our stripe
eg.
p[] = |#|#|#|*
d[] = *|#|#|#|
We must ignore the entry to the left of the leftmost member
We must ignore the entry above the rightmost member
Another subtlety comes from our stripe running off the matrix if the strings aren't
of the same size. Since string s is always swapped to be the shorter of the two,
the stripe will always run off to the upper right instead of the lower left of the matrix.
As a concrete example, suppose s is of length 5, t is of length 7, and our threshold is 1.
In this case we're going to walk a stripe of length 3. The matrix would look like so:
1 2 3 4 5
1 |#|#| | | |
2 |#|#|#| | |
3 | |#|#|#| |
4 | | |#|#|#|
5 | | | |#|#|
6 | | | | |#|
7 | | | | | |
Note how the stripe leads off the table as there is no possible way to turn a string of length 5
into one of length 7 in edit distance of 1.
Additionally, this implementation decreases memory usage by using two
single-dimensional arrays and swapping them back and forth instead of allocating
an entire n by m matrix. This requires a few minor changes, such as immediately returning
when it's detected that the stripe has run off the matrix and initially filling the arrays with
large values so that entries we don't compute are ignored.
See Algorithms on Strings, Trees and Sequences by Dan Gusfield for some discussion.
*/
int n = s.length(); // length of s
int m = t.length(); // length of t
// if one string is empty, the edit distance is necessarily the length of the other
if (n == 0) {
return m <= threshold ? m : -1;
} else if (m == 0) {
return n <= threshold ? n : -1;
}
if (n > m) {
// swap the two strings to consume less memory
final CharSequence tmp = s;
s = t;
t = tmp;
n = m;
m = t.length();
}
int p[] = new int[n + 1]; // 'previous' cost array, horizontally
int d[] = new int[n + 1]; // cost array, horizontally
int _d[]; // placeholder to assist in swapping p and d
// fill in starting table values
final int boundary = Math.min(n, threshold) + 1;
for (int i = 0; i < boundary; i++) {
p[i] = i;
}
// these fills ensure that the value above the rightmost entry of our
// stripe will be ignored in following loop iterations
Arrays.fill(p, boundary, p.length, Integer.MAX_VALUE);
Arrays.fill(d, Integer.MAX_VALUE);
// iterates through t
for (int j = 1; j <= m; j++) {
final char t_j = t.charAt(j - 1); // jth character of t
d[0] = j;
// compute stripe indices, constrain to array size
final int min = Math.max(1, j - threshold);
final int max = (j > Integer.MAX_VALUE - threshold) ? n : Math.min(n, j + threshold);
// the stripe may lead off of the table if s and t are of different sizes
if (min > max) {
return -1;
}
// ignore entry left of leftmost
if (min > 1) {
d[min - 1] = Integer.MAX_VALUE;
}
// iterates through [min, max] in s
for (int i = min; i <= max; i++) {
if (s.charAt(i - 1) == t_j) {
// diagonally left and up
d[i] = p[i - 1];
} else {
// 1 + minimum of cell to the left, to the top, diagonally left and up
d[i] = 1 + Math.min(Math.min(d[i - 1], p[i]), p[i - 1]);
}
}
// copy current distance counts to 'previous row' distance counts
_d = p;
p = d;
d = _d;
}
// if p[n] is greater than the threshold, there's no guarantee on it being the correct
// distance
if (p[n] <= threshold) {
return p[n];
}
return -1;
}

Categories

Resources