Creating a bitmask with a large number of options - java

In my Android app I have a class containing only data (exposed with getters). This class needs to be serialized and sent across to other clients (done naively by iterating over all getters, and storing them in a ByteBuffer).
public class Data
{
public int getOption1() { }
public int getOption2 { }
// ...
public int getOptionN { }
}
Serialize:
public void serialize(Data data) {
// write getOption1();
// write getOption2();
// ...
}
Deserialize:
public void deserialize() {
// read Option1();
// read Option2();
// ...
}
I'd like to be able to define which fields actually get sent (instead of blindly sending all of them), and one potential solution for this would be to define another field which is a bitmask that defines which fields are actually sent.
The receiving side parses the bitmask, and can tell which of the fields should be deserialized from the received message.
The problem is - using an int (32-bit) for bitmask allows for only 32 unique options (by using the "standard" power of 2 enum values).
How can one define a bitmask that can support a larger number of items? is there any other encoding (other than storing each value as a power of 2) ?
The number of actual values may vary (depending on user input) and may be anything from ~ 50 up to 200.
I'd like to encode the different options in the most efficient encoding.

An int provides a bit for each of 32 options. You can use a long to get a bit for each of 64 options. For larger number of options, you can use an int or long array. Take the number of options, divide by 32 (for an int array) or 64 (for a long array) and round up.
A byte array will provide the least waste. Divide the number of options by 8 and round up. You can reserve the first byte to contain the length of the byte array (if you're passing other data as well). Since Byte.MAX_VALUE is 127 (but you can treat the value as the maximum valid index, not the byte count), this limits you to 128 * 8 - 1 = 1023 options (or 2047 options if you are willing to do a little extra work to deal with negative byte count values). The maximum waste will be less than one byte (plus an additional byte of overhead to store the count).
If each option can be independently there or not there, you cannot do much better. If options can be grouped such that all options in a group are always either all present or all absent, then some additional compression may be possible.

Related

Efficiently Storing a Short History of Boolean Events for many Components

To preface this - I have no influence on the design of this problem and I can't really give a lot of details about the technical background.
Say I have a lot of components of the same type that regularly get a boolean event - and I need to hold a short history of these boolean events.
A coworker of mine wrote a rather naive implementation using the type Map<Component, CircularFifoQueue<Boolean>>, CircularFifoQueue being data structure from Apache Commons. The code works, but given how generics work in Java and the dimensions used, this is really inefficient as it stores a reference to one of the two singleton boolean objects instead of just one bit.
Generally there are around 100K components and the history is supposed to hold the 5-10 most recent boolean values (might be subject to change but probably won't be larger than 10). This currently means that around 1.5GB of RAM are allocated just for these history maps. Also these changes happen quite frequently so it wouldn't hurt to increase the CPU efficiency if possible.
One obvious change would be to move the history into the Component class to remove the HashMap-induced overhead.
The more complicated question is how to efficiently store the last few boolean values.
One possible way would be to use BitSets, but as those use long[] as their underlying data structure, I doubt it would be the most efficient way to store what is essentially 5 bits.
Another option would be to directly use an integer and shift the value as a way to remove old entries. So basically
int history = 0;
public void set(int length, boolean active){
if(active) {
history |= 1 << length;
} else {
history &= ~(1 << length);
}
// shift one to the right to remove oldest entry
history = history >> 1;
}
Just off the top of my head. This code is untested. I don't know how efficient or if it works, but that is about what I had in mind.
But that would still lead to quite some overhead compared to the optimal case of storing 5 bits of data using 5 bits of memory.
One could achieve some additional saving if the histories of the different components were stored in a contiguous array, but I'm not sure how to handle either one giant contiguous BitSet. Or alternatively a large byte[] where each byte represents one bool-history as explained above.
This is a weirdly specific problem and I'd be really glad about any suggestions.
Setting aside the bit manipulations which I'm sure you'll conquer, please think how efficient is efficient enough.
Every instance of
class Foo {}
allocates 16 bytes. So if you were to introduce
class ComponentHistory {
private final int bits;
}
that's 20 bytes.
If you replace the int with byte, you're still at 20 bytes: byte type is padded to 4 bytes by JVM (at least).
If you define a global array of bits somewhere and refer to it from ComponentHistory, the reference itself is at least 4 bytes.
Basically, you can't win :)
But consider this: if you go with the simplest approach that you have already outlined, that produces simple readable code, your 100K component histories will take up 2MB of RAM - substantial savings from your current level of 1.5GB. Specifically, you've saved 1498MB.
Suppose you indeed invent a cumbersome yet working way of only storing 5 bits per history. You'd then need 500Kb = 60KB to store all histories. With the baseline of 1.5GB, your savings are now 1499.94MB. Savings improve by 0.1%. Does that at all matter? More often than not, I'd prefer to not over-optimize here while sacrificing simplicity.

Implementing a very efficient bit structure

I'm looking for a solution in pesudo code or java or js for the following problem:
We need to implement an efficient bit structure to hold data for N bits (you could think of the bits as booleans as well, on/off).
We need to support the following methods:
init(n)
get(index)
set(index, True/False)
setAll(True/false)
Now I got to a solution with o(1) in all except for init that is o(n). The idea was to create an array where each index saves value for a bit. In order to support the setAll I would also save a timestamp withe the bit vapue to know if to take the value from tge array or from tge last setAll value. The o(n) in init is because we need to go through the array to nullify it, otherwise it will have garbage which can be ANYTHING. Now I was asked to find a solution where the init is also o(1) (we can create an array, but we cant clear the garbage, the garbage might even look like valid data which is wrong and make the solution bad, we need a solution that works 100%).
Update:
This is an algorithmic qiestion and not a language specific one. I encountered it in an interview question. Also using an integer to represent the bit array is not good enough because of memory limits. I was tipped that it has something to do with some kind of smart handling of garbage data in the array without ckeaning it in the init, using some kind of mechanism to not fall because if the garbage data in the array (but I'm not sure how).
Make lazy data structure based on hashmap (while hashmap sometimes might have worse access time than o(1)) with 32-bit values (8,16,64 ints are suitable too) for storage and auxiliary field InitFlag
To clear all, make empty map with InitFlag = 0 (deleting old map is GC's work in Java, isn't it?)
To set all, make empty map with InitFlag = 1
When changing some bit, check whether corresponding int key bitnum/32 exists. If yes, just change bitnum&32 bit, if not and bit value differs from InitFlag - create key with value based on InitFlag (all zeros or all ones) and change needed bit.
When retrieving some bit, check whether corresponding key exists. If yes, extract bit, if not - get InitFlag value
SetAll(0): ifl = 0, map - {}
SetBit(35): ifl = 0, map - {1 : 0x10}
SetBit(32): ifl = 0, map - {1 : 0x12}
ClearBit(32): ifl = 0, map - {1 : 0x10}
ClearBit(1): do nothing, ifl = 0, map - {1 : 0x10}
GetBit(1): key=0 doesn't exist, return ifl=0
GetBit(35): key=1 exists, return map[1]>>3 =1
SetAll(1): ifl = 1, map = {}
SetBit(35): do nothing
ClearBit(35): ifl = 1, map - {1 : 0xFFFFFFF7 = 0b...11110111}
and so on
If this is a college/high-school computer science test or homework assignment question - I suspect they are trying to get you to use BOOLEAN BIT-WISE LOGIC - specifically, saving the bit inside of an int or a long. I suspect (but I'm not a mind-reader - and I could be wrong!) that using "Arrays" is exactly what your teacher would want you to avoid.
For instance - this quote is copied from Google's Search Reults:
long: The long data type is a 64-bit two's complement integer. The
signed long has a minimum value of -263 and a maximum value of 263-1.
In Java SE 8 and later, you can use the long data type to represent an
unsigned 64-bit long, which has a minimum value of 0 and a maximum
value of 264-1
What that means is that a single long variable in Java could store 64 of your bit-wise values:
long storage;
// To get the first bit-value, use logical-or ('|') and get the bit.
boolean result1 = (boolean) storage | 0b00000001; // Gets the first bit in 'storage'
boolean result2 = (boolean) storage | 0b00000010; // Gets the second
boolean result3 = (boolean) storage | 0b00000100; // Gets the third
...
boolean result8 = (boolean) storage | 0b10000000; // Gets the eighth result.
I could write the entire thing for you, but I'm not 100% sure of your actual specifications - if you use a long, you can only store 64 separate binary values. If you want an arbitrary number of values, you would have to use as many 'long' as you need.
Here is a SO posts about binary / boolean values:
Binary representation in Java
Here is a SO post about bit-shifting:
Java - Circular shift using bitwise operations
Again, it would be a job, and I'm not going to write the entire project. However, the get(int index) and set(int index, boolean val) methods would involve bit-wise shifting of the number 1.
int pos = 1;
pos = pos << 5; // This would function as a 'pointer' to the fifth element of the binary number list.
storage | pos; // This retrieves the value stored as position 5.

Why it's impossible to create an array of MAX_INT size in Java?

I have read some answers for this question(Why I can't create an array with large size? and https://bugs.openjdk.java.net/browse/JDK-8029587) and I don't understand the following.
"In the GC code we pass around the size of objects in words as an int." As I know the size of a word in JVM is 4 bytes. According to this, if we pass around the size of long array of large size (for example, MAX_INT - 5) in words as an int, we must get OutOfMemoryException with Requested array size exceeds VM limit because the size is too large for int even without size of header. So why arrays of different types have the same limit on max count of elements?
Only addressing the why arrays of different types have the same limit on max count of elements? part:
Because it doesn't matter to much in practical reality; but allows the code implementing the JVM to be simpler.
When there is only one limit; that is the same for all kinds of arrays; then you can deal all arrays with that code. Instead of having a lot of type-specific code.
And given the fact that the people that need "large" arrays can still create them; and only those that need really really large arrays are impacted; why spent that effort?
The answer is in the jdk sources as far as I can tell (I'm looking at jdk-9); also after writing it I am not sure if it should be a comment instead (and if it answers your question), but it's too long for a comment...
First the error is thrown from hotspot/src/share/vm/oops/arrayKlass.cpp here:
if (length > arrayOopDesc::max_array_length(T_ARRAY)) {
report_java_out_of_memory("Requested array size exceeds VM limit");
....
}
Now, T_ARRAY is actually an enum of type BasicType that looks like this:
public static final BasicType T_ARRAY = new BasicType(tArray);
// tArray is an int with value = 13
That is the first indication that when computing the maximum size, jdk does not care what that array will hold (the T_ARRAY does not specify what types will that array hold).
Now the method that actually validates the maximum array size looks like this:
static int32_t max_array_length(BasicType type) {
assert(type >= 0 && type < T_CONFLICT, "wrong type");
assert(type2aelembytes(type) != 0, "wrong type");
const size_t max_element_words_per_size_t =
align_size_down((SIZE_MAX/HeapWordSize - header_size(type)), MinObjAlignment);
const size_t max_elements_per_size_t =
HeapWordSize * max_element_words_per_size_t / type2aelembytes(type);
if ((size_t)max_jint < max_elements_per_size_t) {
// It should be ok to return max_jint here, but parts of the code
// (CollectedHeap, Klass::oop_oop_iterate(), and more) uses an int for
// passing around the size (in words) of an object. So, we need to avoid
// overflowing an int when we add the header. See CRs 4718400 and 7110613.
return align_size_down(max_jint - header_size(type), MinObjAlignment);
}
return (int32_t)max_elements_per_size_t;
}
I did not dive too much into the code, but it is based on HeapWordSize; which is 8 bytes at least. here is a good reference (I tried to look it up into the code itself, but there are too many references to it).

Cap'n Proto - Finding Message Size in Java

I am using a TCP Client/Server to send Cap'n Proto messages from C++ to Java.
Sometimes the receiving buffer may be overfilled or underfilled and to handle these cases we need to know the message size.
When I check the size of the buffer in Java I get 208 bytes, however calling
MyModel.MyMessage.STRUCT_SIZE.total()
returns 4 (not sure what unit of measure is being used here).
I notice that 4 divides into 208, 52 times. But I don't know of a significant conversion factor using 52.
How do I check the message size in Java?
MyMessage.STRUCT_SIZE represents the constant size of that struct itself (measured in 8-byte words), but if the struct contains non-trivial fields (like Text, Data, List, or other structs) then those take up space too, and the amount of space they take is not constant (e.g. Text will take space according to how long the string is).
Generally you should try to let Cap'n Proto directly write to / read from the appropriate ByteChannels, so that you don't have to keep track of sizes yourself. However, if you really must compute the size of a message ahead of time, you could do so with something like:
ByteBuffer[] segments = message.getSegmentsForOutput();
int total = (segments.length / 2 + 1) * 8; // segment table
for (ByteBuffer segment: segments) {
total += segment.remaining();
}
// now `total` is the total number of bytes that will be
// written when the message is serialized.
On the C++ size, you can use capnp::computeSerializedSizeInWords() from serialize.h (and multiply by 8).
But again, you really should structure your code to avoid this, by using the methods of org.capnproto.Serialize with streaming I/O.

How to parse bit fields from a byte array in Java?

I've been given the arduous task of parsing some incoming UDP packets from a source into an appropriate Java representation. The kicker is the data held within the packets are not byte aligned. In an effort to make the protocol as tight as possible, there are a number of bit fields indicating the presence or absence of data fields.
For example, at bit index 34 you may find a 24 bit field that should be converted to a float. Bit index 110 may be a flag indicating that the next 3 fields are each 5 and 6 bit values containing the hour, minute, and second of the day. (These are just made up by me but are representative of what the spec says). These packets are probably a few hundred bits long.
The spec is not likely to change, but it's completely possible that I'll be asked to decode other similar packet formats.
I can certainly bit shift and mask as needed, but I'm worried about ending up in bit shift hell and then drowning in it when more packet formats are discovered.
I'd love to hear any suggestions on best practices or Java libraries that may make the task more manageable.
Decoding QR codes is much the same exercise in reading a couple bits at a time, completely unaligned. Here's what I wrote to handle it -- maybe you can just reuse this.
http://code.google.com/p/zxing/source/browse/trunk/core/src/com/google/zxing/common/BitSource.java
for such cases I have developed JBBP library which is accessible in Maven central
for instance parsing of file to bits and printing of parsed values looks like
public static final void main(final String ... args) throws Exception {
try (InputStream inStream = ClassLoader.getSystemClassLoader().getResourceAsStream("somefile.txt")) {
class Bits { #Bin(type = BinType.BIT_ARRAY) byte [] bits; }
for(final byte b : JBBPParser.prepare("bit [_] bits;",JBBPBitOrder.MSB0).parse(inStream).mapTo(Bits.class).bits)System.out.print(b != 0 ? "1" : "0");
}
}

Categories

Resources