I'm reading avro format specification and trying to understand its implementation. Here is the method for decoding long value:
#Override
public long readLong() throws IOException {
ensureBounds(10);
int b = buf[pos++] & 0xff;
int n = b & 0x7f;
long l;
if (b > 0x7f) {
b = buf[pos++] & 0xff;
n ^= (b & 0x7f) << 7;
if (b > 0x7f) {
b = buf[pos++] & 0xff;
n ^= (b & 0x7f) << 14;
if (b > 0x7f) {
b = buf[pos++] & 0xff;
n ^= (b & 0x7f) << 21;
if (b > 0x7f) {
// only the low 28 bits can be set, so this won't carry
// the sign bit to the long
l = innerLongDecode((long)n);
} else {
l = n;
}
} else {
l = n;
}
} else {
l = n;
}
} else {
l = n;
}
if (pos > limit) {
throw new EOFException();
}
return (l >>> 1) ^ -(l & 1); // back to two's-complement
}
The question is why do we always check if 0x7f less then the byte we just read?
This is a form of bit-packing where the most significant bit of each byte is used to determine if another byte should be read. Essentially, this allows you to encode values in a fewer amount of bytes than they would normally require. However, there is the caveat that, if the number is large, then more than the normal amount of bytes will be required. Therefore, this is successful when working with small values.
Getting to your question, 0x7F is 0111_1111 in binary. You can see that the most significant bit is used as the flag bit.
It's 0b1111111 (127), the largest number possible with a unsigned btye, saving one for a flag.
Related
I have an exception when I deserialize a message with a field defined as logicalType date.
As documentation, the field is defined as:
{"name": "startDate", "type": {"type": "int", "logicalType": "date"}}
I use "avro-maven-plugin" (1.9.2) to generate the java classes and I can set the field startDate to java.time.LocalDate.now(); the avro object is serialize the message and send it to a kafka topic. So far, everything is good.
However, when I read the message I get the exception:
Caused by: org.apache.avro.InvalidNumberEncodingException: Invalid int encoding
at org.apache.avro.io.BinaryDecoder.readInt(BinaryDecoder.java:166)
at org.apache.avro.io.ValidatingDecoder.readInt(ValidatingDecoder.java:83)
at org.apache.avro.generic.GenericDatumReader.readInt(GenericDatumReader.java:551)
at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:195)
at org.apache.avro.generic.GenericDatumReader.readWithConversion(GenericDatumReader.java:173)
at org.apache.avro.specific.SpecificDatumReader.readField(SpecificDatumReader.java:134)
at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:247)
at org.apache.avro.specific.SpecificDatumReader.readRecord(SpecificDatumReader.java:123)
at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:179)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:160)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153)
What makes everything even more weird is that no error occurs if I set a different date like LocalDate.of(1970, 1, 1).
In other words, if the serialized int value representing the number of day since 01/01/1970 is small enough, everything works fine.
I tried that test after having a look of the code that raise the exception, it made me think that if the int day is lower that 127 the error could be avoided:
public int readInt() throws IOException {
this.ensureBounds(5);
int len = 1;
int b = this.buf[this.pos] & 255;
int n = b & 127;
if (b > 127) {
b = this.buf[this.pos + len++] & 255;
n ^= (b & 127) << 7;
if (b > 127) {
b = this.buf[this.pos + len++] & 255;
n ^= (b & 127) << 14;
if (b > 127) {
b = this.buf[this.pos + len++] & 255;
n ^= (b & 127) << 21;
if (b > 127) {
b = this.buf[this.pos + len++] & 255;
n ^= (b & 127) << 28;
if (b > 127) {
throw new InvalidNumberEncodingException("Invalid int encoding");
}
}
}
}
}
....
Of course I can't use in production only date close to 01/01/1970.
Any help is welcome :-)
TL:DR
The code that you have posted can deserialize numbers not only up to 127, but the full range of Java int, so up to a couple of billion corresponding to dates several million years after 1970.
Details
The BinaryDecoder.readInt method from Apache Avro deserializes from 1 through 5 bytes into a Java int. It uses the last 7 bits from each byte for the int, only not the sign bit. Instead the sign bit is used for determining how many bytes to read. A sign bit of 0 means this is the last byte. A sign bit of 1 means there are more bytes after this one. The exception is thrown in case 5 bytes are read and they all had their sign bits set to 1. 5 bytes can supply 35 bits, and an int can hold 32 bits, so regarding more than 5 bytes as an error is fair.
So from the code that you have posted no dates that I would reasonably expect to use in an application will pose any problems.
Test code
I put your method in a TestBinaryDecoder class to try it out (full code at the end). Let’s first see how the exception comes from 5 bytes all having their sign bit set to 1:
try {
System.out.println(new TestBinaryDecoder(-1, -1, -1, -1, -1).readInt());
} catch (IOException ioe) {
System.out.println(ioe);
}
Output:
ovv.so.binary.misc.InvalidNumberEncodingException: Invalid int encoding
Also as you said, 127 poses no problem:
System.out.println(new TestBinaryDecoder(127, -1, -1, -1, -1).readInt());
127
The interesting part comes when we put more bytes in holding bits of the int that we want. Here the first byte has a sign bit of 1, the next has 0, so I expect those two bytes to be used:
System.out.println(new TestBinaryDecoder(255, 127, -1, -1, -1).readInt());
16383
We are already getting close to the number needed for today’s date. Today is 2021-06-04 in my time zone, day 18782 after the epoch, or in binary: 100100101011110. So let’s try putting those 15 binary digits into three bytes for the decoder:
int epochDay = new TestBinaryDecoder(0b11011110, 0b10010010, 0b1, -1, -1).readInt();
System.out.println(epochDay);
System.out.println(LocalDate.ofEpochDay(epochDay));
18782
2021-06-04
So how you got your exception I can’t tell. The source surely isn’t just a large int value. The problem must be somewhere else.
Full code
public class TestBinaryDecoder {
private byte[] buf;
private int pos;
/** Convenience constructor */
public TestBinaryDecoder(int... buf) {
this(toByteArray(buf));
}
private static byte[] toByteArray(int[] intArray) {
byte[] byteArray = new byte[intArray.length];
IntStream.range(0, intArray.length).forEach(ix -> byteArray[ix] = (byte) intArray[ix]);
return byteArray;
}
public TestBinaryDecoder(byte[] buf) {
this.buf = buf;
pos = 0;
}
public int readInt() throws IOException {
this.ensureBounds(5);
int len = 1;
int b = this.buf[this.pos] & 255;
int n = b & 127;
if (b > 127) {
b = this.buf[this.pos + len++] & 255;
n ^= (b & 127) << 7;
if (b > 127) {
b = this.buf[this.pos + len++] & 255;
n ^= (b & 127) << 14;
if (b > 127) {
b = this.buf[this.pos + len++] & 255;
n ^= (b & 127) << 21;
if (b > 127) {
b = this.buf[this.pos + len++] & 255;
n ^= (b & 127) << 28;
if (b > 127) {
throw new InvalidNumberEncodingException("Invalid int encoding");
}
}
}
}
}
return n;
}
private void ensureBounds(int bounds) {
System.out.println("Ensuring bounds " + bounds);
}
}
class InvalidNumberEncodingException extends IOException {
public InvalidNumberEncodingException(String message) {
super(message);
}
}
There is a function written in C that calculates CRC16 CCITT. It helps reading data from RFID reader - and basically works fine. I would like to write a function in Java that would do similar thing.
I tried online converter page to do this, but the code I got is garbage.
Could you please take a look at this and advice why Java code that should do the same generates different crc?
Please find attached original C function:
void CRC16(unsigned char * Data, unsigned short * CRC, unsigned char Bytes)
{
int i, byte;
unsigned short C;
*CRC = 0;
for (byte = 1; byte <= Bytes; byte++, Data++)
{
C = ((*CRC >> 8) ^ *Data) << 8;
for (i = 0; i < 8; i++)
{
if (C & 0x8000)
C = (C << 1) ^ 0x1021;
else
C = C << 1;
}
*CRC = C ^ (*CRC << 8);
}
}
And here is the different CRC function written in JAVA that should calculate the same checksum, but it does not:
public static int CRC16_CCITT_Test(byte[] buffer) {
int wCRCin = 0x0000;
int wCPoly = 0x1021;
for (byte b : buffer) {
for (int i = 0; i < 8; i++) {
boolean bit = ((b >> (7 - i) & 1) == 1);
boolean c15 = ((wCRCin >> 15 & 1) == 1);
wCRCin <<= 1;
if (c15 ^ bit)
wCRCin ^= wCPoly;
}
}
wCRCin &= 0xffff;
return wCRCin;
}
When I try calculating 0,2,3 numbers in both functions I got different results:
for C function it is (DEC): 22017
for JAVA function it is (DEC): 28888
OK. I have converter C into Java code and got it partially working.
public static int CRC16_Test(byte[] data, byte bytes) {
int dataIndex = 0;
short c;
short [] crc= {0};
crc[0] = (short)0;
for(int j = 1; j <= Byte.toUnsignedInt(bytes); j++, dataIndex++) {
c = (short)((Short.toUnsignedInt(crc[0]) >> 8 ^ Byte.toUnsignedInt(data[dataIndex])) << 8);
for(int i = 0; i < 8; i++) {
if((Short.toUnsignedInt(c) & 0x8000) != 0) {
c = (short)(Short.toUnsignedInt(c) << 1 ^ 0x1021);
} else {
c = (short)(Short.toUnsignedInt(c) << 1);
}
}
crc[0] = (short)(Short.toUnsignedInt(c) ^ Short.toUnsignedInt(crc[0]) << 8);
}
return crc[0];
}
It gives the same CRC values as C code for 0,2,3 numbers, but i.e. for numbers 255, 216, 228 C code crc is 60999 while JAVA crc is -4537.
OK. Finally thanks to your pointers I got this working.
The last change required was changing 'return crc[0]' to:
return (int) crc[0] & 0xffff;
... and it works...
Many thanks to all :)
There is nothing wrong. For a 16 bit value, –4537 is represented as the exact same 16 bits as 60999 is. If you would like for your routine to return the positive version, convert to int (which is 32 bits) and do an & 0xffff.
I am developing software in JavaCard to addition points in ECC.
the issue is I need some basis operations, so for the moment, I need multiplication and inversion, I already have addition and subtraction.
I was trying to develop montgomery multiplication but it is for GF(2^m) (I think).
so my example is:
public static void multiplicationGF_p2(){
byte A = (byte) 7;
byte p = (byte) 5;
byte B = (byte) 2;
byte C = (byte) 0;
byte n = (byte)8;
byte i = (byte)(n - 1);
for(; i >= 0; i--){
C = (byte)(((C & 0xFF) + (C & 0xFF) ) + ((A & 0xff) << getBytePos(B,i)));
if((C & 0xFF) >= (byte)(p & 0xFF)){
C = (byte) ((C & 0xFF)-(p & 0xFF));
}
if((C & 0xFF) >= (byte)(p & 0xFF)){
C = (byte) ((C & 0xFF)-(p & 0xFF));
}
}
}
for example A = 2, B =3, p= 3 C must be 0, C = A. B (mode p)
but this example A = 7, B=2, p=5 , C must be 4, but I have 49.
can someone help me with that?
more methods:
public static byte getBytePos(byte b, byte pos){
return (byte)(((b & 0xff) >> pos) & 1);
}
I am trying to be simple, for the moment, but the idea is make multiplication of very big number like arrays[10] of bytes
I have supposed that something was wrong here:
C = (byte)(((C & 0xFF) + (C & 0xFF) ) + ((A & 0xff) << getBytePos(B,i)));
I have created a method to multiply byte numbers, not just using shift to the right <<
So:
public static byte bmult(byte x, byte y){
byte total = (byte)0;
byte i;
byte n = (byte)8; // multiplication for 8 bits or 1 byte
for(i = n ; i >= 0 ; i--)
{
total <<= 1;
if( (((y & 0xff) & (1 << i)) >> i) != (byte)0 )
{
total = (byte)(total + x);
}
}
return total;
}
so then I have added it in my original method, (in the line marked):
C = (byte)(((C & 0xFF) + (C & 0xFF) ) + bmult(A, getBytePos(B,i)) );
for now it is working correctly, I need to test it more
someone has another solution ?
This is probably a basic question for out more experienced programmers out there. I'm a bit of a noob and can't work this one out. I'm trying to unpack a binary file and the doco is not too clear on how floats are stored. I have found a routine that does this, but it will only work if I pass an integer array of the bytes. The correct answer is -1865.0. I need to be able to pass the byte array and get the correct answer. How do I need to change the code to make float4byte return -1865.0. Thanks in advance.
import java.nio.ByteBuffer;
import java.nio.ByteOrder;
public class HelloWorld {
public static void main(String[] args) {
byte[] bytes = {(byte) 0xC3,(byte) 0X74,(byte) 0X90,(byte) 0X00 };
int[] ints = {(int) 0xC3,(int) 0X74,(int) 0X90,(int) 0X00 };
// This give the wrong answer
float f = ByteBuffer.wrap(bytes).order(ByteOrder.BIG_ENDIAN).getFloat();
System.out.println("VAL ByteBuffer BI: " + f);
// This give the wrong answer
f = ByteBuffer.wrap(bytes).order(ByteOrder.LITTLE_ENDIAN).getFloat();
System.out.println("VAL ByteBuffer LI: " + f);
//This gives the RIGHT answer
f = float4int (ints[0], ints[1], ints[2], ints[3]);
System.out.println("VAL Integer : " + f);
// This gives the wrong answer
f = float4byte (bytes[0], bytes[1], bytes[2], bytes[3]);
System.out.println("VAL Bytes : " + f);
}
private static float float4int(int a, int b, int c, int d)
{
int sgn, mant, exp;
System.out.println ("IN Int: "+String.format("%02X ", a)+
String.format("%02X ", b)+String.format("%02X ", c)+String.format("%02X ", d));
mant = b << 16 | c << 8 | d;
if (mant == 0) return 0.0f;
sgn = -(((a & 128) >> 6) - 1);
exp = (a & 127) - 64;
return (float) (sgn * Math.pow(16.0, exp - 6) * mant);
}
private static float float4byte(byte a, byte b, byte c, byte d)
{
int sgn, mant, exp;
System.out.println ("IN Byte : "+String.format("%02X ", a)+
String.format("%02X ", b)+String.format("%02X ", c)+String.format("%02X ", d));
mant = b << 16 | c << 8 | d;
if (mant == 0) return 0.0f;
sgn = -(((a & 128) >> 6) - 1);
exp = (a & 127) - 64;
return (float) (sgn * Math.pow(16.0, exp - 6) * mant);
}
}
The reason why your solution with ByteBuffer doesn't work: the bytes do not match the (Java) internal representation of the float value.
The Java representation is
System.out.println(Integer.toHexString(Float.floatToIntBits(-1865.0f)));
which gives c4e92000
bytes are signed in Java. When calculating the mantissa mant, the bytes are implicitly converted from bytes to ints - with the sign "extended", i.e. (byte)0x90 (decimal -112) gets converted 0xFFFFFF90 (32 bits int). However what you want is just the original bytes' 8 bits (0x00000090).
In order to compensate for the effect of sign extension, it suffices to change one line:
mant = (b & 0xFF) << 16 | (c & 0xFF) << 8 | (d & 0xFF)
Here, in (c & 0xFF), the 1-bits caused by sign extension are stripped after (implicit) conversion to int.
Edit:
The repacking of floats could be done via the IEEE 754 representation which can be obtained by Float.floatToIntBits (which avoids using slow logarithms). Some complexity in the code is caused by the change of base from 2 to 16:
private static byte[] byte4float(float f) {
assert !Float.isNaN(f);
// see also JavaDoc of Float.intBitsToFloat(int)
int bits = Float.floatToIntBits(f);
int s = (bits >> 31) == 0 ? 1 : -1;
int e = (bits >> 23) & 0xFF;
int m = (e == 0) ? (bits & 0x7FFFFF) << 1 : (bits& 0x7FFFFF) | 0x800000;
int exp = (e - 150) / 4 + 6;
int mant;
int mantissaShift = (e - 150) % 4; // compensate for base 16
if (mantissaShift >= 0) mant = m << mantissaShift;
else { mant = m << (mantissaShift + 4); exp--; }
if (mant > 0xFFFFFFF) { mant >>= 4; exp++; } // loose of precision
byte a = (byte) ((1 - s) << 6 | (exp + 64));
return new byte[]{ a, (byte) (mant >> 16), (byte) (mant >> 8), (byte) mant };
}
The code does not take into account any rules that may exist for the packaging, e.g. for representing zero or normalization of the mantissa. But it might serve as a starting point.
Thanks to #halfbit and a bit of testing and minor changes, this routine appears convert IEEE 754 float into IBM float.
public static byte[] byte4float(float f) {
assert !Float.isNaN(f);
// see also JavaDoc of Float.intBitsToFloat(int)
int bits = Float.floatToIntBits(f);
int s = (bits >> 31) == 0 ? 1 : -1;
int e = (bits >> 23) & 0xFF;
int m = (e == 0) ? (bits & 0x7FFFFF) << 1 : (bits& 0x7FFFFF) | 0x800000;
int exp = (e - 150) / 4 + 6;
int mant;
int mantissaShift = (e - 150) % 4; // compensate for base 16
if (mantissaShift >= 0) mant = m >> mantissaShift;
else mant = m >> (Math.abs(mantissaShift));
if (mant > 0xFFFFFFF) { mant >>= 4; exp++; } // loose of precision */
byte a = (byte) ((1 - s) << 6 | (exp + 64));
return new byte[]{ a, (byte) (mant >> 16), (byte) (mant >> 8), (byte) mant };
}
I think this is right and appears to be working.
I am trying to understand the below code.
The method getKey() returns a string, and getDistance() returns a double. The code is a taken from a class which is meant to hold String (the key) and Double (the distance) pairs.
To be more specific I am unsure as to what the lines that do the shifting do.
public void serialize (byte[] outputArray) {
// write the length of the string out
byte[] data = getKey().getBytes ();
for (int i = 0; i < 2; i++) {
outputArray[i] = (byte) ((data.length >>> ((1 - i) * 8)) & 0xFF);
}
// write the key out
for (int i = 0; i < data.length; i++) {
outputArray[i + 2] = data[i];
}
// now write the distance out
long bits = Double.doubleToLongBits (getDistance());
for (int i = 0; i < 8; i++) {
outputArray[i + 2 + data.length] = (byte) ((bits >>> ((7 - i) * 8)) & 0xFF);
}
}
Any help would be very appreciated.
>>> is unsigned shift to right operator. It shifts the sign bit too.
& 0xFF retains bits to make a 8-bit (byte) value, otherwise you may have some garbage.
Start by reading Java's tutorial on bitwise operators. In short:
>>> is an unsigned right shift
& 0xFF is ANDing the outcome of (bits >>> ((7 - i) * 8)) with 0xFF