Avro Message - InvalidNumberEncodingException deserializing logicalType date - java

I have an exception when I deserialize a message with a field defined as logicalType date.
As documentation, the field is defined as:
{"name": "startDate", "type": {"type": "int", "logicalType": "date"}}
I use "avro-maven-plugin" (1.9.2) to generate the java classes and I can set the field startDate to java.time.LocalDate.now(); the avro object is serialize the message and send it to a kafka topic. So far, everything is good.
However, when I read the message I get the exception:
Caused by: org.apache.avro.InvalidNumberEncodingException: Invalid int encoding
at org.apache.avro.io.BinaryDecoder.readInt(BinaryDecoder.java:166)
at org.apache.avro.io.ValidatingDecoder.readInt(ValidatingDecoder.java:83)
at org.apache.avro.generic.GenericDatumReader.readInt(GenericDatumReader.java:551)
at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:195)
at org.apache.avro.generic.GenericDatumReader.readWithConversion(GenericDatumReader.java:173)
at org.apache.avro.specific.SpecificDatumReader.readField(SpecificDatumReader.java:134)
at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:247)
at org.apache.avro.specific.SpecificDatumReader.readRecord(SpecificDatumReader.java:123)
at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:179)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:160)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153)
What makes everything even more weird is that no error occurs if I set a different date like LocalDate.of(1970, 1, 1).
In other words, if the serialized int value representing the number of day since 01/01/1970 is small enough, everything works fine.
I tried that test after having a look of the code that raise the exception, it made me think that if the int day is lower that 127 the error could be avoided:
public int readInt() throws IOException {
this.ensureBounds(5);
int len = 1;
int b = this.buf[this.pos] & 255;
int n = b & 127;
if (b > 127) {
b = this.buf[this.pos + len++] & 255;
n ^= (b & 127) << 7;
if (b > 127) {
b = this.buf[this.pos + len++] & 255;
n ^= (b & 127) << 14;
if (b > 127) {
b = this.buf[this.pos + len++] & 255;
n ^= (b & 127) << 21;
if (b > 127) {
b = this.buf[this.pos + len++] & 255;
n ^= (b & 127) << 28;
if (b > 127) {
throw new InvalidNumberEncodingException("Invalid int encoding");
}
}
}
}
}
....
Of course I can't use in production only date close to 01/01/1970.
Any help is welcome :-)

TL:DR
The code that you have posted can deserialize numbers not only up to 127, but the full range of Java int, so up to a couple of billion corresponding to dates several million years after 1970.
Details
The BinaryDecoder.readInt method from Apache Avro deserializes from 1 through 5 bytes into a Java int. It uses the last 7 bits from each byte for the int, only not the sign bit. Instead the sign bit is used for determining how many bytes to read. A sign bit of 0 means this is the last byte. A sign bit of 1 means there are more bytes after this one. The exception is thrown in case 5 bytes are read and they all had their sign bits set to 1. 5 bytes can supply 35 bits, and an int can hold 32 bits, so regarding more than 5 bytes as an error is fair.
So from the code that you have posted no dates that I would reasonably expect to use in an application will pose any problems.
Test code
I put your method in a TestBinaryDecoder class to try it out (full code at the end). Let’s first see how the exception comes from 5 bytes all having their sign bit set to 1:
try {
System.out.println(new TestBinaryDecoder(-1, -1, -1, -1, -1).readInt());
} catch (IOException ioe) {
System.out.println(ioe);
}
Output:
ovv.so.binary.misc.InvalidNumberEncodingException: Invalid int encoding
Also as you said, 127 poses no problem:
System.out.println(new TestBinaryDecoder(127, -1, -1, -1, -1).readInt());
127
The interesting part comes when we put more bytes in holding bits of the int that we want. Here the first byte has a sign bit of 1, the next has 0, so I expect those two bytes to be used:
System.out.println(new TestBinaryDecoder(255, 127, -1, -1, -1).readInt());
16383
We are already getting close to the number needed for today’s date. Today is 2021-06-04 in my time zone, day 18782 after the epoch, or in binary: 100100101011110. So let’s try putting those 15 binary digits into three bytes for the decoder:
int epochDay = new TestBinaryDecoder(0b11011110, 0b10010010, 0b1, -1, -1).readInt();
System.out.println(epochDay);
System.out.println(LocalDate.ofEpochDay(epochDay));
18782
2021-06-04
So how you got your exception I can’t tell. The source surely isn’t just a large int value. The problem must be somewhere else.
Full code
public class TestBinaryDecoder {
private byte[] buf;
private int pos;
/** Convenience constructor */
public TestBinaryDecoder(int... buf) {
this(toByteArray(buf));
}
private static byte[] toByteArray(int[] intArray) {
byte[] byteArray = new byte[intArray.length];
IntStream.range(0, intArray.length).forEach(ix -> byteArray[ix] = (byte) intArray[ix]);
return byteArray;
}
public TestBinaryDecoder(byte[] buf) {
this.buf = buf;
pos = 0;
}
public int readInt() throws IOException {
this.ensureBounds(5);
int len = 1;
int b = this.buf[this.pos] & 255;
int n = b & 127;
if (b > 127) {
b = this.buf[this.pos + len++] & 255;
n ^= (b & 127) << 7;
if (b > 127) {
b = this.buf[this.pos + len++] & 255;
n ^= (b & 127) << 14;
if (b > 127) {
b = this.buf[this.pos + len++] & 255;
n ^= (b & 127) << 21;
if (b > 127) {
b = this.buf[this.pos + len++] & 255;
n ^= (b & 127) << 28;
if (b > 127) {
throw new InvalidNumberEncodingException("Invalid int encoding");
}
}
}
}
}
return n;
}
private void ensureBounds(int bounds) {
System.out.println("Ensuring bounds " + bounds);
}
}
class InvalidNumberEncodingException extends IOException {
public InvalidNumberEncodingException(String message) {
super(message);
}
}

Related

CRC16 CCITT Java implementation

There is a function written in C that calculates CRC16 CCITT. It helps reading data from RFID reader - and basically works fine. I would like to write a function in Java that would do similar thing.
I tried online converter page to do this, but the code I got is garbage.
Could you please take a look at this and advice why Java code that should do the same generates different crc?
Please find attached original C function:
void CRC16(unsigned char * Data, unsigned short * CRC, unsigned char Bytes)
{
int i, byte;
unsigned short C;
*CRC = 0;
for (byte = 1; byte <= Bytes; byte++, Data++)
{
C = ((*CRC >> 8) ^ *Data) << 8;
for (i = 0; i < 8; i++)
{
if (C & 0x8000)
C = (C << 1) ^ 0x1021;
else
C = C << 1;
}
*CRC = C ^ (*CRC << 8);
}
}
And here is the different CRC function written in JAVA that should calculate the same checksum, but it does not:
public static int CRC16_CCITT_Test(byte[] buffer) {
int wCRCin = 0x0000;
int wCPoly = 0x1021;
for (byte b : buffer) {
for (int i = 0; i < 8; i++) {
boolean bit = ((b >> (7 - i) & 1) == 1);
boolean c15 = ((wCRCin >> 15 & 1) == 1);
wCRCin <<= 1;
if (c15 ^ bit)
wCRCin ^= wCPoly;
}
}
wCRCin &= 0xffff;
return wCRCin;
}
When I try calculating 0,2,3 numbers in both functions I got different results:
for C function it is (DEC): 22017
for JAVA function it is (DEC): 28888
OK. I have converter C into Java code and got it partially working.
public static int CRC16_Test(byte[] data, byte bytes) {
int dataIndex = 0;
short c;
short [] crc= {0};
crc[0] = (short)0;
for(int j = 1; j <= Byte.toUnsignedInt(bytes); j++, dataIndex++) {
c = (short)((Short.toUnsignedInt(crc[0]) >> 8 ^ Byte.toUnsignedInt(data[dataIndex])) << 8);
for(int i = 0; i < 8; i++) {
if((Short.toUnsignedInt(c) & 0x8000) != 0) {
c = (short)(Short.toUnsignedInt(c) << 1 ^ 0x1021);
} else {
c = (short)(Short.toUnsignedInt(c) << 1);
}
}
crc[0] = (short)(Short.toUnsignedInt(c) ^ Short.toUnsignedInt(crc[0]) << 8);
}
return crc[0];
}
It gives the same CRC values as C code for 0,2,3 numbers, but i.e. for numbers 255, 216, 228 C code crc is 60999 while JAVA crc is -4537.
OK. Finally thanks to your pointers I got this working.
The last change required was changing 'return crc[0]' to:
return (int) crc[0] & 0xffff;
... and it works...
Many thanks to all :)
There is nothing wrong. For a 16 bit value, –4537 is represented as the exact same 16 bits as 60999 is. If you would like for your routine to return the positive version, convert to int (which is 32 bits) and do an & 0xffff.

What's so special about 0x7f?

I'm reading avro format specification and trying to understand its implementation. Here is the method for decoding long value:
#Override
public long readLong() throws IOException {
ensureBounds(10);
int b = buf[pos++] & 0xff;
int n = b & 0x7f;
long l;
if (b > 0x7f) {
b = buf[pos++] & 0xff;
n ^= (b & 0x7f) << 7;
if (b > 0x7f) {
b = buf[pos++] & 0xff;
n ^= (b & 0x7f) << 14;
if (b > 0x7f) {
b = buf[pos++] & 0xff;
n ^= (b & 0x7f) << 21;
if (b > 0x7f) {
// only the low 28 bits can be set, so this won't carry
// the sign bit to the long
l = innerLongDecode((long)n);
} else {
l = n;
}
} else {
l = n;
}
} else {
l = n;
}
} else {
l = n;
}
if (pos > limit) {
throw new EOFException();
}
return (l >>> 1) ^ -(l & 1); // back to two's-complement
}
The question is why do we always check if 0x7f less then the byte we just read?
This is a form of bit-packing where the most significant bit of each byte is used to determine if another byte should be read. Essentially, this allows you to encode values in a fewer amount of bytes than they would normally require. However, there is the caveat that, if the number is large, then more than the normal amount of bytes will be required. Therefore, this is successful when working with small values.
Getting to your question, 0x7F is 0111_1111 in binary. You can see that the most significant bit is used as the flag bit.
It's 0b1111111 (127), the largest number possible with a unsigned btye, saving one for a flag.

Converting Byte[4] to float - Integer[4] array works but byte[4] does not

This is probably a basic question for out more experienced programmers out there. I'm a bit of a noob and can't work this one out. I'm trying to unpack a binary file and the doco is not too clear on how floats are stored. I have found a routine that does this, but it will only work if I pass an integer array of the bytes. The correct answer is -1865.0. I need to be able to pass the byte array and get the correct answer. How do I need to change the code to make float4byte return -1865.0. Thanks in advance.
import java.nio.ByteBuffer;
import java.nio.ByteOrder;
public class HelloWorld {
public static void main(String[] args) {
byte[] bytes = {(byte) 0xC3,(byte) 0X74,(byte) 0X90,(byte) 0X00 };
int[] ints = {(int) 0xC3,(int) 0X74,(int) 0X90,(int) 0X00 };
// This give the wrong answer
float f = ByteBuffer.wrap(bytes).order(ByteOrder.BIG_ENDIAN).getFloat();
System.out.println("VAL ByteBuffer BI: " + f);
// This give the wrong answer
f = ByteBuffer.wrap(bytes).order(ByteOrder.LITTLE_ENDIAN).getFloat();
System.out.println("VAL ByteBuffer LI: " + f);
//This gives the RIGHT answer
f = float4int (ints[0], ints[1], ints[2], ints[3]);
System.out.println("VAL Integer : " + f);
// This gives the wrong answer
f = float4byte (bytes[0], bytes[1], bytes[2], bytes[3]);
System.out.println("VAL Bytes : " + f);
}
private static float float4int(int a, int b, int c, int d)
{
int sgn, mant, exp;
System.out.println ("IN Int: "+String.format("%02X ", a)+
String.format("%02X ", b)+String.format("%02X ", c)+String.format("%02X ", d));
mant = b << 16 | c << 8 | d;
if (mant == 0) return 0.0f;
sgn = -(((a & 128) >> 6) - 1);
exp = (a & 127) - 64;
return (float) (sgn * Math.pow(16.0, exp - 6) * mant);
}
private static float float4byte(byte a, byte b, byte c, byte d)
{
int sgn, mant, exp;
System.out.println ("IN Byte : "+String.format("%02X ", a)+
String.format("%02X ", b)+String.format("%02X ", c)+String.format("%02X ", d));
mant = b << 16 | c << 8 | d;
if (mant == 0) return 0.0f;
sgn = -(((a & 128) >> 6) - 1);
exp = (a & 127) - 64;
return (float) (sgn * Math.pow(16.0, exp - 6) * mant);
}
}
The reason why your solution with ByteBuffer doesn't work: the bytes do not match the (Java) internal representation of the float value.
The Java representation is
System.out.println(Integer.toHexString(Float.floatToIntBits(-1865.0f)));
which gives c4e92000
bytes are signed in Java. When calculating the mantissa mant, the bytes are implicitly converted from bytes to ints - with the sign "extended", i.e. (byte)0x90 (decimal -112) gets converted 0xFFFFFF90 (32 bits int). However what you want is just the original bytes' 8 bits (0x00000090).
In order to compensate for the effect of sign extension, it suffices to change one line:
mant = (b & 0xFF) << 16 | (c & 0xFF) << 8 | (d & 0xFF)
Here, in (c & 0xFF), the 1-bits caused by sign extension are stripped after (implicit) conversion to int.
Edit:
The repacking of floats could be done via the IEEE 754 representation which can be obtained by Float.floatToIntBits (which avoids using slow logarithms). Some complexity in the code is caused by the change of base from 2 to 16:
private static byte[] byte4float(float f) {
assert !Float.isNaN(f);
// see also JavaDoc of Float.intBitsToFloat(int)
int bits = Float.floatToIntBits(f);
int s = (bits >> 31) == 0 ? 1 : -1;
int e = (bits >> 23) & 0xFF;
int m = (e == 0) ? (bits & 0x7FFFFF) << 1 : (bits& 0x7FFFFF) | 0x800000;
int exp = (e - 150) / 4 + 6;
int mant;
int mantissaShift = (e - 150) % 4; // compensate for base 16
if (mantissaShift >= 0) mant = m << mantissaShift;
else { mant = m << (mantissaShift + 4); exp--; }
if (mant > 0xFFFFFFF) { mant >>= 4; exp++; } // loose of precision
byte a = (byte) ((1 - s) << 6 | (exp + 64));
return new byte[]{ a, (byte) (mant >> 16), (byte) (mant >> 8), (byte) mant };
}
The code does not take into account any rules that may exist for the packaging, e.g. for representing zero or normalization of the mantissa. But it might serve as a starting point.
Thanks to #halfbit and a bit of testing and minor changes, this routine appears convert IEEE 754 float into IBM float.
public static byte[] byte4float(float f) {
assert !Float.isNaN(f);
// see also JavaDoc of Float.intBitsToFloat(int)
int bits = Float.floatToIntBits(f);
int s = (bits >> 31) == 0 ? 1 : -1;
int e = (bits >> 23) & 0xFF;
int m = (e == 0) ? (bits & 0x7FFFFF) << 1 : (bits& 0x7FFFFF) | 0x800000;
int exp = (e - 150) / 4 + 6;
int mant;
int mantissaShift = (e - 150) % 4; // compensate for base 16
if (mantissaShift >= 0) mant = m >> mantissaShift;
else mant = m >> (Math.abs(mantissaShift));
if (mant > 0xFFFFFFF) { mant >>= 4; exp++; } // loose of precision */
byte a = (byte) ((1 - s) << 6 | (exp + 64));
return new byte[]{ a, (byte) (mant >> 16), (byte) (mant >> 8), (byte) mant };
}
I think this is right and appears to be working.

Bit manipulation C source in Java

I try to calculate the checksum of a Sega Genesis rom file in Java. For this i want to port a code snipped from C into Java:
static uint16 getchecksum(uint8 *rom, int length)
{
int i;
uint16 checksum = 0;
for (i = 0; i < length; i += 2)
{
checksum += ((rom[i] << 8) + rom[i + 1]);
}
return checksum;
}
I understand what the code does. It sums all 16bit numbers (combined from two 8 bit ones). But what i didn't understand is what's happening with the overflow of the uint16 and how this transfers to Java code?
Edit:
This code seems to work, thanks:
int calculatedChecksum = 0;
int bufferi1=0;
int bufferi2=0;
bs = new BufferedInputStream(new FileInputStream(this.file));
bufferi1 = bs.read();
bufferi2 = bs.read();
while(bufferi1 != -1 && bufferi2 != -1){
calculatedChecksum += (bufferi1*256 + bufferi2);
calculatedChecksum = calculatedChecksum % 0x10000;
bufferi1 = bs.read();
bufferi2 = bs.read();
}
Simply put, the overflow is lost.
A more correct approach (imho) is to use uint32 for summation, and then you have the sum in the lower 16 bits, and the overflow in the upper 16 bits.
static int checksum(final InputStream in) throws IOException {
short v = 0;
int c;
while ((c = in.read()) >= 0) {
v += (c << 8) | in.read();
}
return v & 0xffff;
}
This should work equivalently; by using & 0xffff, we get to treat the value in v as if it were unsigned the entire time, since arithmetic overflow is identical w.r.t. bits.
You want addition modulo 216, which you can simply spell out manually:
checksum = (checksum + ((rom[i] << 8) + rom[i + 1])) % 0x10000;
// ^^^^^^^^^

Extract bit sequences of arbitrary length from byte[] array efficiently

I'm looking for the most efficient way of extracting (unsigned) bit sequences of arbitrary length (0 <= length <= 16) at arbitrary position. The skeleton class show how my current implementation essentially handles the problem:
public abstract class BitArray {
byte[] bytes = new byte[2048];
int bitGet;
public BitArray() {
}
public void readNextBlock(int initialBitGet, int count) {
// substitute for reading from an input stream
for (int i=(initialBitGet>>3); i<=count; ++i) {
bytes[i] = (byte) i;
}
prepareBitGet(initialBitGet, count);
}
public abstract void prepareBitGet(int initialBitGet, int count);
public abstract int getBits(int count);
static class Version0 extends BitArray {
public void prepareBitGet(int initialBitGet, int count) {
bitGet = initialBitGet;
}
public int getBits(int len) {
// intentionally gives meaningless result
bitGet += len;
return 0;
}
}
static class Version1 extends BitArray {
public void prepareBitGet(int initialBitGet, int count) {
bitGet = initialBitGet - 1;
}
public int getBits(int len) {
int byteIndex = bitGet;
bitGet = byteIndex + len;
int shift = 23 - (byteIndex & 7) - len;
int mask = (1 << len) - 1;
byteIndex >>= 3;
return (((bytes[byteIndex] << 16) |
((bytes[++byteIndex] & 0xFF) << 8) |
(bytes[++byteIndex] & 0xFF)) >> shift) & mask;
}
}
static class Version2 extends BitArray {
static final int[] mask = { 0x0, 0x1, 0x3, 0x7, 0xF, 0x1F, 0x3F, 0x7F, 0xFF,
0x1FF, 0x3FF, 0x7FF, 0xFFF, 0x1FFF, 0x3FFF, 0x7FFF, 0xFFFF };
public void prepareBitGet(int initialBitGet, int count) {
bitGet = initialBitGet;
}
public int getBits(int len) {
int offset = bitGet;
bitGet = offset + len;
int byteIndex = offset >> 3; // originally used /8
int bitIndex = offset & 7; // originally used %8
if ((bitIndex + len) > 16) {
return ((bytes[byteIndex] << 16 |
(bytes[byteIndex + 1] & 0xFF) << 8 |
(bytes[byteIndex + 2] & 0xFF)) >> (24 - bitIndex - len)) & mask[len];
} else if ((offset + len) > 8) {
return ((bytes[byteIndex] << 8 |
(bytes[byteIndex + 1] & 0xFF)) >> (16 - bitIndex - len)) & mask[len];
} else {
return (bytes[byteIndex] >> (8 - offset - len)) & mask[len];
}
}
}
static class Version3 extends BitArray {
int[] ints = new int[2048];
public void prepareBitGet(int initialBitGet, int count) {
bitGet = initialBitGet;
int put_i = (initialBitGet >> 3) - 1;
int get_i = put_i;
int buf;
buf = ((bytes[++get_i] & 0xFF) << 16) |
((bytes[++get_i] & 0xFF) << 8) |
(bytes[++get_i] & 0xFF);
do {
buf = (buf << 8) | (bytes[++get_i] & 0xFF);
ints[++put_i] = buf;
} while (get_i < count);
}
public int getBits(int len) {
int bit_idx = bitGet;
bitGet = bit_idx + len;
int shift = 32 - (bit_idx & 7) - len;
int mask = (1 << len) - 1;
int int_idx = bit_idx >> 3;
return (ints[int_idx] >> shift) & mask;
}
}
static class Version4 extends BitArray {
int[] ints = new int[1024];
public void prepareBitGet(int initialBitGet, int count) {
bitGet = initialBitGet;
int g = initialBitGet >> 3;
int p = (initialBitGet >> 4) - 1;
final byte[] b = bytes;
int t = (b[g] << 8) | (b[++g] & 0xFF);
final int[] i = ints;
do {
i[++p] = (t = (t << 16) | ((b[++g] & 0xFF) <<8) | (b[++g] & 0xFF));
} while (g < count);
}
public int getBits(final int len) {
final int i;
bitGet = (i = bitGet) + len;
return (ints[i >> 4] >> (32 - len - (i & 15))) & ((1 << len) - 1);
}
}
public void benchmark(String label) {
int checksum = 0;
readNextBlock(32, 1927);
long time = System.nanoTime();
for (int pass=1<<18; pass>0; --pass) {
prepareBitGet(32, 1927);
for (int i=2047; i>=0; --i) {
checksum += getBits(i & 15);
}
}
time = System.nanoTime() - time;
System.out.println(label+" took "+Math.round(time/1E6D)+" ms, checksum="+checksum);
try { // avoid having the console interfere with our next measurement
Thread.sleep(369);
} catch (InterruptedException e) {}
}
public static void main(String[] argv) {
BitArray test;
// for the sake of getting a little less influence from the OS for stable measurement
Thread.currentThread().setPriority(Thread.MAX_PRIORITY);
while (true) {
test = new Version0();
test.benchmark("no implementaion");
test = new Version1();
test.benchmark("Durandal's (original)");
test = new Version2();
test.benchmark("blitzpasta's (adapted)");
test = new Version3();
test.benchmark("MSN's (posted)");
test = new Version4();
test.benchmark("MSN's (half-buffer modification)");
System.out.println("--- next pass ---");
}
}
}
This works, but I'm looking for a more efficient solution (performance wise). The byte array is guaranteed to be relatively small, between a few bytes up to a max of ~1800 bytes. The array is read exactly once (completely) between each call to the read method. There is no need for any error checking in getBits(), such as exceeding the array etc.
It seems my initial question above isn't clear enough. A "bit sequence" of N bits forms an integer of N bits, and I need to extract those integers with minimal overhead. I have no use for strings, as the values are either used as lookup indices or are directly fed into some computation. So basically, the skeleton shown above is a real class and getBits() signature shows how the rest of the code interacts with it.
Extendet the example code into a microbenchmark, included blitzpasta's solution (fixed missing byte masking). On my old AMD box it turns out as ~11400ms vs ~38000ms. FYI: Its the divide and modulo operations that kill the performance. If you replace /8 with >>3 and %8 with &7, both solutions are pretty close to each other (jdk1.7.0ea104).
There seemed to be a bit confusion about how and what to work on. The first, original post of the example code included a read() method to indicate where and when the byte buffer was filled. This got lost when the code was turned into the microbench. I re-introduced it to make this a little clearer.
The idea is to beat all existing versions by adding another subclass of BitArray which need to implement getBits() and prepareBitGet(), the latter may be empty. Do not change the benchmarking to give your solution an advantage, the same could be done for all the existing solutions, making this a completely moot optimization! (really!!)
I added a Version0, which does nothing but increment the bitGet state. It always returns 0 to get a rough idea how big the benchmark overhead is. Its only there for comparison.
Also, an adaption on MSN's idea was added (Version3). To keep things fair and comparable for all competitors, the byte array filling is now part of the benchmark, as well as a preparatory step (see above). Originally MSN's solution did not do so well, there was lots of overhead in preparing the int[] buffer. I took the liberty of optimizing the step a little, which turned it into a fierce competitor :)
You might also find that I de-convoluted your code a little. Your getBit() could be condensed into a 3-liner, probably shaving off one or two percent. I deliberately did this to keep the code readable and because the other versions aren't as condensed as possible either (again for readability).
Conclusion (code example above update to include versions based on all applicable contributions). On my old AMD box (Sun JRE 1.6.0_21), they come out as:
V0 no implementaion took 5384 ms
V1 Durandal's (original) took 10283 ms
V2 blitzpasta's (adapted) took 12212 ms
V3 MSN's (posted) took 11030 ms
V4 MSN's (half-buffer modification) took 9700 ms
Notes: In this benchmark an average of 7.5 bits is fetched per call to getBits(), and each bit is only read once. Since V3/V4 have to pay a high initialization cost, they tend to show better runtime behavior with more, shorter fetches (and consequently worse the closer to the maximum of 16 the average fetch size gets). Still, V4 stays slightly ahead of all others in all scenarios.
In an actual application, the cache contention must be taken into account, since the extra space needed for V3/v4 may increase cache misses to a point where V0 would be a better choice. If the array is to be traversed more than once, V4 should be favored, since it fetches faster than every other and the costly initialization is amortized after the fist pass.
If you just want the unsigned bit sequence as an int.
static final int[] lookup = {0x0, 0x1, 0x3, 0x7, 0xF, 0x1F, 0x3F, 0x7F, 0xFF, 0x1FF, 0x3FF, 0x7FF, 0xFFF, 0x1FFF, 0x3FFF, 0x7FFF, 0xFFFF };
/*
* bytes: byte array, with the bits indexed from 0 (MSB) to (bytes.length * 8 - 1) (LSB)
* offset: index of the MSB of the bit sequence.
* len: length of bit sequence, must from range [0,16].
* Not checked for overflow
*/
static int getBitSeqAsInt(byte[] bytes, int offset, int len){
int byteIndex = offset / 8;
int bitIndex = offset % 8;
int val;
if ((bitIndex + len) > 16) {
val = ((bytes[byteIndex] << 16 | bytes[byteIndex + 1] << 8 | bytes[byteIndex + 2]) >> (24 - bitIndex - len)) & lookup[len];
} else if ((offset + len) > 8) {
val = ((bytes[byteIndex] << 8 | bytes[byteIndex + 1]) >> (16 - bitIndex - len)) & lookup[len];
} else {
val = (bytes[byteIndex] >> (8 - offset - len)) & lookup[len];
}
return val;
}
If you want it as a String (modification of Margus' answer).
static String getBitSequence(byte[] bytes, int offset, int len){
int byteIndex = offset / 8;
int bitIndex = offset % 8;
int count = 0;
StringBuilder result = new StringBuilder();
outer:
for(int i = byteIndex; i < bytes.length; ++i) {
for(int j = (1 << (7 - bitIndex)); j > 0; j >>= 1) {
if(count == len) {
break outer;
}
if((bytes[byteIndex] & j) == 0) {
result.append('0');
} else {
result.append('1');
}
++count;
}
bitIndex = 0;
}
return result.toString();
}
Well, depending on how far you want to go down the time vs. memory see-saw, you can allocate a side table of every 32-bits at every 16-bit offset and then do a mask and shift based on the 16-bit offset:
byte[] bytes = new byte[2048];
int bitGet;
unsigned int dwords[] = new unsigned int[2046];
public BitArray() {
for (int i=0; i<bytes.length; ++i) {
bytes[i] = (byte) i;
}
for (int i= 0; i<dwords.length; ++i) {
dwords[i]=
(bytes[i ] << 24) |
(bytes[i + 1] << 16) |
(bytes[i + 2] << 8) |
(bytes[i + 3]);
}
}
int getBits(int len)
{
int offset= bitGet;
int offset_index= offset>>4;
int offset_offset= offset & 15;
return (dwords[offset_index] >> offset_offset) & ((1 << len) - 1);
}
You avoid the branching (at the cost of quadrupling your memory footprint). And is looking up the mask really that much faster than (1 << len) - 1?
Just wondering why can't you use java.util.BitSet;
Basically what you can do, is to read the whole data as byte[], convert it to binary in string format and use string utilities like .substring() to do the work. This will also work bit sequences > 16.
Lets say you have 3 bytes: 1, 2, 3 and you want to extract bit sequence from 5th to 16th bit.
Number Binary
1 00000001
2 00000010
3 00000011
Code example:
public static String getRealBinary(byte[] input){
StringBuilder sb = new StringBuilder();
for (byte c : input) {
for (int n = 128; n > 0; n >>= 1){
if ((c & n) == 0)
sb.append('0');
else sb.append('1');
}
}
return sb.toString();
}
public static void main(String[] args) {
byte bytes[] = new byte[]{1,2,3};
String sbytes = getRealBinary(bytes);
System.out.println(sbytes);
System.out.println(sbytes.substring(5,16));
}
Output:
000000010000001000000011
00100000010
Speed:
I did a testrun for 1m times and on my computer it took 0.995s, so its reasonably very fast:
Code to repeat the test yourself:
public static void main(String[] args) {
Random r = new Random();
byte bytes[] = new byte[4];
long start, time, total=0;
for (int i = 0; i < 1000000; i++) {
r.nextBytes(bytes);
start = System.currentTimeMillis();
getRealBinary(bytes).substring(5,16);
time = System.currentTimeMillis() - start;
total+=time;
}
System.out.println("It took " +total + "ms");
}
You want at most 16 bits, taken from an array of bytes. 16 bits can span at most 3 bytes.
Here's a possible solution:
int GetBits(int bit_index, int bit_length) {
int byte_offset = bit_index >> 3;
return ((((((byte_array[byte_offset]<<8)
+byte_array[byte_offset+1])<<8)
+byte_array[byte_offset+2]))
>>(24-(bit_index&7)+bit_length))))
&((1<<bit_length)-1);
}
[Untested]
If you call this a lot you should precompute the 24-bit values for the 3 concatenated bytes, and store those into an int array.
I'll observe that if you are coding this in C on an x86, you don't even need to precompute the 24 bit array; simply access the by te array at the desire offset as a 32 bit value. The x86 will do unaligned fetches just fine. [commenter noted that endianess mucks this up, so it isn't an answer, OK, do the 24 bit version.]
Since Java 7 BitSet has the toLongArray method, which I believe will do exactly what the question asks for:
int subBits = (int) bitSet.get(lowBit, highBit).toLongArray()[0];
This has the advantage that it works with sequences larger than ints or longs. It has the performance disadvantage that a new BitSet object must be allocated, and a new array object to hold the result.
It would be really interesting to see how this compares with the other methods in the benchmark.

Categories

Resources